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PREFACE 


Interest in the field of parallel processing continues to climb. This trend is evidenced by the sharp increase in 
papers submitted to the International Conference on Parallel Processing during recent years: 


Papers Papers 
Year Submitted Accepted Percent 
1980 170 65 57 
1983 240 136 57 
1986 400 170 43 
1987 487 174 36 
1988 590 173 29 


Although the number of submissions continues to increase, the number of accepted papers this year and in the 
past two years has remained relatively unchanged. This is due to the limitation imposed by the fixed number 
of hours available for the conference. As a result, a record number of papers had to be rejected. This year, 
the conference proceedings is being published in three volumes according to the subject category. The 
breakdown of submissions and acceptances in the three main categories of this conference is as follows: 


Papers Papers 
Category submitted Accepted Percent 
Architecture 264 74 28 
Software 144 43 30 
Algorithms and Applications 182 56 31 


Of the 173 papers that were accepted, 79 were accepted as regular papers and 94 were accepted as short 
papers. Many papers that normally would have been accepted as long papers were accepted as short papers in 
order to meet the maximum number of paper-sessions allotted for the conference. 


Finding sufficient numbers of qualified reviewers to evaluate the record number of submissions this year was a 
particularly challenging task. Over 1,000 professionals in the field participated in this process. This year the 
process of selecting referees was simplified by the use of questionnaires, which were mailed to previous 
participants in the conference. The information on the completed questionnaires was entered into databases, 
which then allowed the conference chairmen to select reviewers qualified in fairly specialized fields. Even so, 
numerous papers were so highly specialized that custom selection of referees was still required. It appears 
that an even more detailed breakdown of specializations will be needed for these questionnaires in the future. 
Greater effort will also be required in the future to find additional reviewers to adequately evaluate the 
increasing numbers of submissions. | 


I am grateful to Sun Microsystems Inc., for the support and in particular, to Wayne Rosing (Vice President of 
Advanced Development) for giving me the opportunity to co-chair the ICPP88 program. I am very grateful to 
the reviewers for their timely and thorough evaluation, and the many other persons who assisted in the 
program effort this year. Many thanks are due to administrators, Janice Barnes and Marianne Witkop, in 
helping to make the proceedings a reality. In particular, I would like to express my appreciation to Alex Kwok, 
Michel Cekleov and Roland Lee who assisted in selecting referees and in handling the correspondence. Special 
thanks are due to Alex Kwok for developing user-friendly author and referee databases, and for automating 
the generation of correspondence for handling the papers. Finally, I wish to thank Prof. Tse-yun Feng for his 
guidance, support and encouragement in this effort. 


Fayé A. Briggs 
1988 Program Co-chair 


Sun Microsystems, Inc. 
Mountain View, CA 94043 
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ABSTRACT 


In the design of multicomputer systems, the scheduling 
and mapping of a parallel algorithm onto a host architecture 
has a critical impact on overall system performance. In this 


paper we develop a graph-based solution to both aspects of 


the mapping problem using the simulated annealing opti- 
mization heuristic. A two phase mapping strategy is formu- 
lated: 1) process annealing assigns parallel processes to 
processing nodes, and 2) connection annealing schedules traf- 
fic connections on network data links so that interprocess 
communication conflicts are minimized. To evaluate the 
quality of generated mappings, cost functions suitable for 
simulated annealing are derived that accurately quantify 
communication overhead. Application examples are pre- 
sented using the hypercube as a host architecture, with host 
graphs containing up to 512 nodes. 


I, INTRODUCTION 


Multicomputers are a form of parallel processing system 
composed of many processing elements (PE’s), cach with its 
own local memory. Individual PE’s are connected to other 
PE’s by point-to-point links that allow the bidirectional 
transfer of data. The cost of connecting every processor to 
every other processor is typically prohibitive, so links connect 
only selected processors, forming an _ interconnection 
topology such as a mesh, tree, or binary hypercube. Each 
processor executes a task or process. Local references by a 
process are efficient, since the PE contains local memory, but 
communication with processes executing on other PE’s can 
significantly limit system throughput if data must be trans- 
ferred over many links or if links are congested due to exces- 
sive traffic. To realize the full potential of a multicomputer’s 
capabilities, it is essential that the distance between commu- 
nicating processes be minimized and that link traffic is mini- 
nuzed to reduce delay. 


The mapping problem maps an image architecture, a set 
of processes and their communication requirements, onto a 
multicomputer or host architecture. The problem consists 
of two components: 1) assignment of processes to processors, 
and 2) assignment or scheduling of interprocess communi- 
cation traffic over network links. This paper presents a new 
approach to processor and link assignment in multicomput- 
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ers based on the simulated annealing heuristic. The proce- 
dure has been implemented for binary hypercube host 
architectures. Results indicate that the technique produces 
good, and often optimal, mappings within reasonable com- 
putation times. 


The paper first discusses the mapping problem, recent 
research, and communication overhead cost functions that 
can be used in an objective function. Simulated annealing is 
then applied to the mapping problem to find processor and 
link assignments that minimize the objective function for 
given host and image architectures. The final section pre- 
sents results for mapping two image architectures, each con- 
taining up to 512 processes, onto a binary hypercube 
multicomputer. 


I. THE MAPPING PROBLEM 


The assignment and scheduling problem concerns the 
mapping of an arbitrary image architecture onto a general- 
purpose host or target architecture in a manner that mini- 
mizes communication conflicts among concurrent processes. 
lor our purposes, the image architecture consists of a set of 
synchronous, static ‘processes with communication require- 
ments known prior to run time. The host architecture de- 
scribes a point-to-point multiprocessor network with a fixed 
interconnection topology. To evaluate the quality of an as- 
signment, an objective function 1s used to quantify the com- 
munication cost. The behavior of the mapping algorithm 
and the quality of generated assignments depend on the ob- 
jective function chosen. 


A. Host Architecture 


The host architecture is represented by a host graph that 
describes the interconnection of processors in a multi- 
processor network. The host graph 1s denoted by the undi- 
rected graph G,= <V,,E,> where V, is a_ set of 
processors, and E,, is a set of edges describing the communi- 
cation paths between processors. Every vertex in V,, corre- 
sponds to a distinct processing element, referred to as a 
node. Every edge (n,, 2.) € E, corresponds to a bidirectional 
data link between nodes n, and n,. This implies that a single 
physical network link exists between each pair of directly 
connected processors. The host graph is assumed to be a 
connected graph, disallowing the possibility of isolated 
processor nodes. The following terminology is used when 
referring to the host architecture. 


N The number of nodes in the host network, 
N=|Vyl. 


A node j in the network, l<j< WN. 


My 

bi A link permitting communication from node n, to 
node n, Note that J, and /, are equivalent. 

bp The amount of communication traffic, in packets 
or units of traffic, flowing from node n, to n,. 

d(n,,n,) The distance between two nodes n, and n,, or the 


minimum number of links forming a path between 
the nodes. Several paths of length d(n,, n,) may ex- 
ist between the processes. 


B. Image Architecture 


The image architecture to be mapped onto the host net- 
work is represented by an image graph that describes the 
communication dependencies between concurrent processes. 
The image graph is a directed graph G,= <V,, E, W,;>. 
Every vertex in V, corresponds to an individual process. Ev- 
ery edge (p,,p,)¢ E, corresponds to a one-way data con- 
nection between processes p, and p,. This does not impose 
any limitations on process communication, as a mutual data 
dependency may be represented as two opposing directed 
edges. The weight of an image edge w,, € W, represents the 
expected traffic from p, to p,. The following terminology is 
used when referring to the image architecture. 


P The number of processes to be mapped, P = |V,|. 

D; A process j in the image architecture, 1 <j < P. 

Wi The communication requirement in packets or units 
of traffic between processes p, and p,. 

Ci The effective communication cost of a connection 


between processes p, and p,. 
J 


I J, € Vy such that the mapping function /: V, > Vy 
assigns process p, to node f. 

d(p,;, p,) The minimum number of links forming a path be- 

tween the nodes which execute processes 

p, and p,, i.e. df, fi) Several paths of length 

a(p,, p,) May exist between the processes. 


An abbreviated form of the distance function d, is used 
where the meaning is apparent from context. The term d;, 
may be interpreted as either d(n,, 1,) or d(p,, p,) when appro- 
priate. Communication weights are integer numbers, and 
traffic between two processes is indivisible; the traffic may 
not be split and routed along different network paths. Ordi- 
narily, the number of processes P is equal to the number of 
available nodes N to maximize the use of processor resources. 
If P is less than N, however, N— P dummy processes may 
be added to the image graph. To accommodate the possibil- 
ity of isolated processes, the image graph G, may be uncon- 
nected. The case P> WN poses load sharing problems which 
are not considered in this paper. 


C. Prior Work 


One evaluation criteria commonly implemented in opti- 
mization algorithms is the objective function found in the 
quadratic assignment problem [1]. Cast in terms of the 
mapping problem, the problem may be stated as follows. A 
set of P processes has associated with it a communication 
traffic intensity w,, between each pair of processes p, and p,. 
A set of N processor nodes are configured with a distance or 
delay d(n,, n,) between nodes n, andn,. Then the communi- 


f 


cation overhead between two processes j and k is the product 
of w, and d(f, f,), and the optimal mapping f minimizes 


> ie ; Afi, fx): 


i,k 


This objective function treats the communication traffic be- 
tween every pair of processes as if it is independent of all 
other processes, which is only true if nodes communicate 
along dedicated network links. Thus it does not accurately 
characterize the high local traffic densities and communi- 
cation bottlenecks that may arise among concurrent proc- 
esses. 


A mapping strategy using the cardinality of the mapping 
for the objective function was investigated by Bokhari [2]. 
Using cardinality as a measure of assignment quality, the 
objective is to maximize the number of pairs of communi- 
cating processes that fall on pairs of directly connected 
processors, thereby maximizing the number of image edges 
that map to host edges. This strategy fails to account for the 
significant effect that unmatched pairs of edges can have on 
the total communication overhead. Also, the algorithm as- 
signs a uniform traffic intensity to all pairs of communicating 
processes, which limits its application. Bokhari states that a 
mapping algorithm using cardinality as an objective function 
exhibits behavior very similar to the quadratic assignment 
problem. 


Bianchini and Shen [3]-[4] describe a method to auto- 
matically assign interprocessor communication in special 
purpose multiple processor systems, e.g. digital signal proc- 
essing systems. The objective function used in the algorithm 
determines a communication cost based on the utility of 
network links, where utility is defined as the fraction of link 
capacity utilized by traffic. They do not consider the issue 
of process assignment; the traffic scheduler accepts a fixed 
placement of processes in the host architecture and then 
generates an optimal communication schedule for that par- 
ticular assignment. This is acceptable when considering only 
dedicated heterogeneous architectures, where there may be 
little opportunity for optimizing the assignment of the image 
architecture to the processor nodes. For general-purpose 
homogeneous architectures, however, the assignment of 
processes has a substantial impact on the quality of the final 
traffic schedule and the overall system throughput. 


To overcome the inadequacies of traditional objective 
functions, Lee and Aggarwal [5] formulate a set of new ob- 
jective functions that accurately quantify communication 
overhead. Their functions measure the optimality of a map- 
ping for general applications by considering the communi- 
cation cost of all image edges along with the overall mode 
of communication, synchronous or asynchronous. This al- 


lows realistic evaluation of the network contention that oc- 


curs when concurrent processes compete for communication 
resources, Lee and Aggarwal also describe an efficient map- 
ping strategy developed for the objective functions. While 
the mapping strategy addresses the problem of optimal 
process assignment, it utilizes a fixed routing scheme for 
traffic scheduling. Such a mapping scheme does not consider 
the possibility of exploiting the routing rules of a network to 
optimize the assignment of the image connections to network 
data paths. 


D. Objective and Cost Functions 


The objective function determines the performance 
characteristics of the mapping algorithm by specifying an 
appropriate optimization goal. To provide a realistic evalu- 


ation of the total communication overhead, the traffic inten- 
sity of each weighted image connection must be considered. 


Cost Functions: The communication cost c, of an image 
connection between p, and p, is a function of the weight or 
traflic intensity of the corresponding edge in the image graph. 
If the connection is routed along dedicated network links, the 
communication cost Cl may represented as 


Cl = cy = Wy Ah, fy)- 


In this case the cost of the connection is determined by the 
distance separating the nodes which p, and p, are mapped 
onto. 


In general, a connection is established along network 
links that are shared by a number of different processes. 
Some links may be used by several processes, and communi- 
cation along the connection will experience delays due to link 
sharing. The delay encountered at a network link is propor- 
tional to the total traffic intensity supported by that link. 
To quantify the effect of the overall delay on the cost of the 
connection, the delay at every link must be taken into ac- 
count. Some additional definitions are needed. 


L, Link number i (1 <i<d,) in the connection be- 
tween p, and p, under consideration. 

D, The amount of delay at link L,. 

UAL)  U,{£) = 1 if the connection between processes p, 


and p, is routed along link L; U,(L, =0 otherwise. 


Then the delay at each link in the connection is represented 
by 


D; = >a : UAL), 
S,ft 


and the cost of a connection between p, and p, by 


diy 
C2= Gy = > Dy 
ix] 


If subscripts j and & are interpreted to mean the nodes n, and 
n, connected by link L, then the above expression for D, 1s 
equivalent to 4,, the traffic intensity of link J,. Therefore the 
delay or communication cost of a network link is propor- 
tional to the total traffic routed along the link. 


Given a means to calculate the communication cost of 
each image connection, an objective function can be defined 
to determine the overall cost of a mapping. The following 
functions are adaptations of two of the four objective func- 
tions investigated in [5]. 


Objective Functions: A simple optimization criterion used 
in VLSI placement problems involves summing the costs as- 
sociated with pairs of components to obtain an overall sys- 
tem cost. In the mapping problem, this corresponds to 
summing the communication cost between every pair of 
processes in the network. The total communication cost Fl 


can be written as 
Fl = PGs 
Jk 


- Using this function with cost function Cl does give some 
measure of the quality of an assignment, but it ignores the 


conflicts due to link sharing by different image connections. 
Thus Fl should be combined with cost function C2 to form 
an objective function suitable for the mapping problem. 


To more accurately describe the quantity being opti- 
mized in multiprocess communication, a second objective 
function #2 can be defined. When all processes in the net- 
work are synchronized, the image connection with the largest 
communication cost determines the overall performance. To 
characterize this behavior, 2 is defined as 


F2 = max(c;,). 
rk (cjx) 


F2 may be used with either cost function Cl or C2. Mini- 
mizing either Fl or 2 does not necessarily minimize the 
other, so the objective function used must be chosen with 
care. The choice depends on the application under consider- 
ation as well as the mapping algorithm used. 


Ill. ASSIGNMENT AND SCHEDULING USING 


rer nectarines eterna 


SIMULATED ANNEALING 


For large scale mapping problems, obtaining an exact 
optimal solution is not practical. Iterative improvement al- 
gorithms have been employed in the mapping problem with 
some success, however, they tend to produce solutions that 
are locally but not globally optimal. The simulated annealing 
method supplements iterative improvement by providing a 
mechanism to escape local optima and has been found to 
exhibit desirable solutions in combinatorial optimization 
problems similar to the mapping problem [6]. Existing 
mapping algorithms utilizing iterative improvement provide 
a basis for a new approach to the mapping problem that uses 
simulated annealing. 


A. Partitioning the Problem for Simulated Annealing 


In the mapping problem, both the assignment of proc- 
esses to the host network and the scheduling of communi- 
cation paths are critical to the overall system performance. 
To optimize the mapping of the image graph to the host 
graph, a two phase mapping strategy is required. The first 
phase is essentially a placement problem; it attempts to de- 
termine the best mapping of processes onto nodes without 
considering the details of traffic routing. The second phase 
is analogous to the wiring problem; it optimizes the decom- 
position of traffic connections onto network links, operating 
within the constraints imposed by the routing rules of the 
network. Both optimization phases may be implemented us- 
ing simulated annealing. 


The design of a good simulated annealing algorithm re- 
quires the specification of four elements: system configura- 
tion, annealing schedule, move set, and objective function 
[6]. By varying the elements, a single annealing algorithm 
can be extended to work with both optimization phases. In 
the following sections, we concentrate on the aspects of sim- 
ulated annealing unique to the mapping problem. A general 
objective function suitable for annealing is formulated, and 
algorithm modifications specific to the assignment and 
scheduling phases are described. 


B. Objective Function for Annealing 


For effective annealing, the objective function should 
exhibit a wide range of values corresponding to the factors 
being optimized. Optimal configurations should have mini- 
mum cost, and inferior or physically unrealizable configura- 
tions should be penalized by high costs. Of the two objective 
functions previously defined, Fl produces a greater variation 


in cost, making it more desirable as a cost metric for simu- 
lated annealing. However, objective function F2 more accu- 


rately characterizes the quantity being optimized in the 


mapping problem. To satisfy these conflicting requirements, 
both F1 and F2 should be considered. Examining the form 
of Fl and F2 shows that there is negligible overhead incurred 
by keeping track of F2 as F1 is being calculated for a config- 
uration. An assignment that minimizes Fl but increases F2 
is not desirable, as F2 describes the actual limiting factor in 
synchronous multiprocess communication. 


A new objective function is formulated to provide a sin- 
gle evaluation criterion by including both FI and F2 as terms. 
Introducing a constant weight factor W, one possibility for 
such a function is 


F=Fl+W-F2, 


where W penalizes any configuration that increases F2. The 
magnitude of IV should be large enough so that the minimum 
variation in W.F2 for a single move is greater than the 
maximum variation of Fl. This ensures that a move in- 
creasing (decreasing) #2 will produce an increase (decrease) 
in the overall cost of a configuration. To achieve a similar 
effect, we consider both Fl and F2 during annealing by: 1) 
ignoring F2 during high temperature annealing when tempo- 
rary increases in F2 and the objective function are to be ex- 
pected, and 2) rejecting all moves generated during low 
temperature annealing that increase F2. This objective func- 
tion used for annealing and defined as Fl combined with F2 
will be referred to as the standard objective function. 


C. Processor Assignment 


The first mapping phase assigns image processes to host 
network processing nodes. The basic strategy of this phase 


is to assign processes with large mutual communication re- 
quirements to neighboring nodes in the host network. This 
phase does not consider detailed traffic routing. However, 
the spatial locality and communication requirements of 
processes must be considered simultaneously to generate an 
optimal mapping. Mapping phase one will be referred to as 
process annealing, and is characterized as follows: 


1. Move Set: Moves are generated by pairwise exchanges 
of processes, Monte Carlo style. To evaluate the effect 
of a move on the objective function, the only process 
connections that need to be considered are those associ- 
ated with the two swapped processes. 


2. Objective Function: Since traffic routing is not consid- 
ered during process assignment, cost function Cl must 
be used. Thus the cost of a connection is the product of 
the communication intensity and the distance between 
the nodes hosting the processes. The overall assignment 
cost is determined by C1 used with the standard objective 
function. 


For process assignment the spatial location of nodes to 
which the process will be assigned are fixed by the physical 
structure of the host network. The only possible move is the 
pairwise exchange of two processes. JTlowever, one of the 
processes may be a dummy process inserted in the host net- 
work to account for excess processing nodes. This form of 
move corresponds to a process translation. For the special 


case N > P, this move gives the mapping algorithm an addi- . 
tional degree of freedom, enabling it to move processes 


among surplus nodes to determine the best distribution of 
processes. In the final stages of process annealing, the ex- 
change of distant processes is unlikely to result in an im- 
provement in the objective function, and only processes 


separated by small distances are considered for exchange. 
Limiting the range of attempted moves in this fashion maxi- 
mizes the number of feasible moves attempted at each tem- 
perature stage. 


D. Link Assignment 


The second phase of mapping, referred to as connection 
annealing, schedules the interprocess communication onto a 
physical network topology. Given the fixed process mapping 
generated by process annealing, connection annealing deter- 
mines an optimal assignment of image connections to host 
network data paths. This phase routes the connection be- 
tween every pair of communicating processes onto a path of 
one or more network data links between the source and des- 
tination nodes. When communicating processes are assigned 
to directly connected nodes, the corresponding data path for 
the connection will consist of a single link. Otherwise, the 
connection must be routed along a series of data links. In 
general, a connection should be routed along the least possi- 
ble number of links to minimize network path delay. How- 
ever, an indirect route may be necessary to avoid heavily 
utilized data links if adding traffic to that link would exceed 
its capacity. 


Unless all processes are assigned to directly connected 
nodes, corresponding to a perfect mapping, the possibility 
exists for link sharing among image connections. To avoid 
communication delays caused by the resulting link con- 
tention, the link assignment phase should route traffic along 
links supporting minimum traffic intensity whenever possible. 
By evenly distributing the traffic load among network links, 
the total network throughput can be maximized. Thus con- 
nection annealing must consider both path length and traffic 
intensity to determine an optimal link assignment. The 


annealing algorithm for link assignment incorporates the fol- 
lowing elements: 


1. Move Set: Path moves are more difficult to generate 
than the simple random pairwise exchanges in process 
annealing. A path is formed by starting at. the source 
node, and then choosing links according to criteria based 
on both traffic intensity and remaining path distance. 
The link may be selected by fixed or adaptive means, 
depending on network routing rules. To evaluate the ef- 
fect of a move on the communication cost of an assign- 
ment, the only data links that need to be considered are 
those affected by the connection being altered. 


2. Objective Function: To characterize the interaction and 
contention between image connections that arise during 
traffic routing, cost function C2 must be used. Thus the 
cost of a connection is the sum of the traffic intensities 
supported by the network links assigned to the con- 
nection. The overall assignment cost is determined by 
C2 used with the standard objective function. 


In process annealing the method used to generate moves 
is basically limited to the pairwise exchange of processes. 
For connection annealing, however, several possibilities exist 
for move generation. The method selected to rearrange a 
path depends on the routing flexibility allowed by the host 
network and the objective function used for scheduling. 
Given a source and destination node in the network and a 
criteria for selecting links, the path generator must establish 
a path if none is present, or produce a permutation of an 
existing path. 


The scheme used by the host network for traffic sched- 
uling limits the routing strategy used for link assignment. 
When the assignment of specific data links to a network path 


is predetermined by the routing rules of the network, path 
generation is limited to fixed path selection. If network 
routing rules allow for several possible paths between source 
and destination nodes, link assignments may be based on an 
adaptive selection criterion. 


During path generation, the adaptive assignment ap- 
proach considers the existing traffic conditions produced by 
previously routed image connections. Starting with the 
source node, there may be several feasible choices for an ini- 
tial path link that reduce the remaining path distance. The 
adaptive criterion specifies that the link /, supporting the 
minimum traffic ¢, should be selected. If multiple links sup- 
port the same minimum J,,, then the next path link is selected 
from them randomly. This process is repeated, selecting 
minimum cost links until the path is completed. Due to its 
greedy nature, adaptive selection will not always produce a 
least cost path. In addition, there is no guarantee that a re- 
arranged path will actually be different from the original 
path. Adaptive path selection is more efficient than random 
path generation, and is useful for both the initial link assign- 
ment and the connection annealing optimization phase. 


LE. Initial Assignment 


For process and connection annealing, assignment of the 
initial system configuration can have a profound effect on 
both required run time and final mapping quality. Instead 
of relying on a random initial configuration, a procedure can 
be used to achieve a good initial assignment, and annealing 
can begin at a lower starting temperature to reduce run time. 
Using an initial assignment algorithm produces a system 
configuration that contains partially ordered domains. If the 
initial mapping is prepared carefully, the structure of these 
partially ordered regions corresponds to those existing in an 
optimal system configuration. Thus the amount of annealing 
required to locate a globally optimal mapping is greatly re- 
duced. To preserve the advantages of a good initial mapping, 
the starting temperature must be chosen low enough to 
search the immediate state space without destroying the de- 
sirable features of the mapping. Starting too low, however, 
may cause the annealing algorithm to become trapped in an 
inferior local minimum. 


Our experience indicates that using an initial assignment 
algorithm to generate process assignments for large image 
architectures can produce suboptimal configurations that are 
difficult to escape by low temperature annealing. Conse- 
quently we do not utilize the initial assignment algorithm for 
process annealing, and rely instead upon a random initial 
assignment with a complete temperature annealing run. The 
additional computation time is justified by the higher quality 
of final solutions obtained. 


Unlike the case of initial process assignment, we found 
that the quality of traffic configurations produced by an ini- 
tial link assignment algorithm coupled with a low temper- 
ature connection annealing run were comparable to those 
generated by a random initial assignment and full annealing. 
Connection assignments produced by such an algorithm are 
greatly superior to the unbalanced, chaotic network paths 
generated by a random initial connection assignment. The 
selection of the initial annealing temperature is crucial for an 
efficient interface between the initial link assignment and 
connection annealing algorithms. 


IV. APPLICATIONS USING HYPERCUBE 


In this section the performance of simulated annealing 
is demonstrated using two image architectures. The host 
network is implemented as a binary hypercube topology, a 
popular architecture for large scale multicomputers [7]. In 
the following discussion, D is the dimension of the 
hypercube, and L is the number of communication links in 
the hypercube, where L = D- 2°-, 


A. Hypercube as host architecture 


In a sizable hypercube network incorporating hundreds 
or thousands of processor nodes, node assignment and com- 
munication scheduling are difficult problems. Process as- 
signment is complicated by the ability to map a number of 
different, complex image topologies onto the hypercube. In 
addition, for every pair of communicating nodes separated 
by d, links, there are d,! possible paths along which a con- 
nection can be routed. Determining a suitable mapping for 
a large hypercube network exercises the full power of the 
simulated annealing assignment algorithms. 


lor a general image graph, the communication overhead 
of the optimal mapping is unknown prior to assignment and 
scheduling. Due to the nondeterministic and heuristic nature 
of simulated annealing algorithms, there is no guarantee that 
the best mapping will be found. To accurately measure the 
performance of the mapping algorithm, an image graph with 
a known global minimum is used as a test case. The image 
graph chosen is similar in form to that of the hypercube host 
graph. 


B. Hypercube Traffic Problem 


The hypercube image graph is denoted by 
Guy = <V,E,W>, 
(2), Pr) CE V py Pee V | dv, py) = 1, 
JE| =D-2?, 
Wie =1V (p;, py) € E. 


Here the distance function d() is defined as for the hypercube, 
but is used instead with process indices. The connection 
structure of the graph corresponds to the hypercube graph, 
with weight 1 on each edge. [or every communication link 
in the hypercube, there are two opposing directed edges in 
the hypercube image graph. 


The performance of the mapping algorithm is evaluated 
by mapping random permutations of the hypercube image 
graph onto the hypercube host. The optimal solution is 
known in this case, so the relative quality of assignments 
generated by the algorithm can be determined. In the ideal 
mapping, every network link supports two connections with 
weight 1, so 
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If the algorithm succeeds in finding the optimal assign- 
ment for this graph, all pairs of communicating processes fall 
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Fig. 1. Total Generated Moves vs. Problem Size 


on nearest neighbor connected nodes in the hypercube, and 
no traffic scheduling is needed. Thus the first phase of 
annealing, processor assignment, is critical for this image ar- 
chitecture. For an optimal assignment, the actual communi- 
cation overhead equals the ideal communication overhead. 
-The mapping algorithm was run on hypercube traffic graphs 
containing from 8 to 512 processes. The results are tabulated 
in Table 1, and represent average values obtained using ran- 
dom initial assignments. The total moves column gives the 
total number of moves generated during all stages of 
annealing. Figure 1 demonstrates the relationship between 
computational effort and problem size. 


By adjusting values for the initial temperature and rate 
of annealing according to the problem size, we were eble to 
converge into optimal solutions consistently for N < 128. 
The initial temperature is taken high enough so that the ratio 
of accepted moves to total proposed moves exceeds 0.9, en- 
suring that the majority of generated moves are accepted. 
Large problem sizes require a slower cooling rate to investi- 
gate a greater portion of the problem search space. This in- 
creases the probability of finding an optimum solution, at the 
expense of increased execution time. 


C. Tree Traffic Problem 


Tree image graphs are frequently encountered in parallel 
processing applications. The graph considered here has the 
form of a binary tree, described by 


Gy = <V,E,W>, 
|E| =2-(N— 1D), 
Wie V Di, PRE v | (P;, Px) E Grrep: 


An extra process is added to the graph and connected to the 
root of a standard binary tree so that N is a power of 2. There 
is no known closed form solution for the minimum commu- 
nication overhead for a mapping of the tree graph onto a 
hypercube, so Fl,,,, and F2,,,, are unknown. 


To map the tree graph onto a hypercube, both process 
and connection annealing were used. Each phase of 
annealing is effective in reducing the overall communication 
cost. In all cases a random initial configuration was used, 
and the parameters for the annealing schedule were chosen 
to provide a good balance between mapping optimality and 
algorithm computation requirements. Table 2 shows the re- 


sults for tree sizes containing 8-to 512 nodes. The tabulated 
values for total moves reflect the sum of process moves gen- 
erated during process annealing, and path moves generated 
during connection annealing. Mapping results show that the 
hypercube network provides excellent support for the com- 
munication requirements of the tree image graph. 


V. CONCLUSIONS 


A graph-based scheme utilizing the simulated annealing 
optimization heuristic has been developed for the automated 
mapping of an arbitrary image graph onto a general-purpose 
multiprocessor architecture. The complete procedure em- 
ploys two annealing-based optimization phases. Process 
annealing attempts to assign processes that exhibit high mu- 
tual communication requirements to neighboring nodes in 
the host network. Connection annealing incorporates an in- 
itial assignment procedure, and further reduces communi- 
cation costs by performing traffic routing of data paths. A 
communication cost function is formulated that captures the 
effect of transmission delays and bottlenecks arising as proc- 
esses compete for communication resources. 


The simulated annealing technique is easily extended to 
generate mappings for a large class of host and image archi- 
tectures. The underlying annealing procedure is completely 
general, and makes no assumptions about the intercon- 
nection structure of the host or image architectures. De- 
pending on the application, varying parameters such as the 
class of moves generated, the cost functions, and the 
annealing schedule enable the behavior of the mapping algo- 
rithm to be modified for maximum performance. As cur- 
rently implemented, the procedure uses a binary hypercube 
topology as the host architecture. The mapping scheme has 


‘been evaluated using a variety of image graphs. We were 


able to anneal into optimal solutions for N < 128, and near- 
optimal solutions for larger image architectures. Our results 
show that the strategy scales well for large problem sizes, 
obtaining good results with computational effort propor- 
tional to small powers of N. 
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Abstract — Parallel processing of symbolic computa- 
tions on a message-passing multi-processor presents one 
challenge: To effectively utilize the available processors, 
the load must be distributed uniformly to all the: proces- 
sors. However, the structure of these computations can- 
not be predicted in advance. So, static scheduling 
methods are not applicable. In this paper, we compare 
the performance of two dynamic, distributed load 
balancing methods for small-grained tasks on large 
parallel machines. 

1. Introduction 


Processor utilization is a key factor that decides the 
speedup provided by a parallel system. A thousand pro- 
cessor system can provide a speedup of 1000 only #f all 
the processors can be kept busy all the time. Ideally, the 
computation should be divided in P equal parts (where P 
is the number of processors), one for each processor. 
But, it is usually impossible to identify ‘P equal parts’ 
except for highly structured computations. An alterna- 
tive is to divide the computation into many small 
granules. Then, even if these granules are of unequal 
sizes, their large number would allow us to distribute 
them equally. Many parallel evaluation schemes for 
functional programs, logic programs, problem-solving, 
searching etc., offer such a small grain of parallelism. 


The large pool of tasks may lead to a increased 
speedup only if there is an effective load distribution 
scheme, one that ensures that no processors remain idle 
while there is work available in the system. This is par- 
ticularly true on a message—passing multiprocessor. 


What sort of load-balancing system is needed for a 
message passing system? The unpredictability of compu- 
tation structures implies that it must be a dynamic or 
run—time strategy, as opposed to a static or compile-time 
strategy. For scalability, it must not be centralized at a 
few PEs, but distributed on all of them. Also, it should 
not depend on global information. Each PE should only 
use the information provided by its neighbors. 


There has been a substantial amount of research on 
the problem of load balancing and load distribution [1, 2, 
8]. However, most of it has been in the context of either 
large—grain tasks, or a relatively small number of proces- 
sors, or in the context of real-time tasks. Much work 
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has been done for static load balancing’, where the 
task—to—processor mapping is decided ahead of run—time. 
There has been very little work on dynamic load balanc- 
ing for fine-grained parallel tasks running on a large 
number of (100s to tens of thousands) parallel processors. 


In this paper, we compare the performance of two 
such load balancing schemes. One of them is ‘contract- 
ing within a netghborhood (CWN), a relatively simple 
strategy proposed by us [3]. The other is the Gradient 
Model (GM) proposed by Lin and Keller [6]. 

2. The Competitors 


The small grain tasks found in most application 
domains have some interesting features in common. 
When activated, they execute for a short time, and then 
either complete, or start some sub-tasks and awaits 
response from them. The same cycle is repeated on 
receiving a response. Usually, it is prohibitively expen- 
sive to move a task from a PE to another after it nas 
spawned sub-tasks. Both the strategies we describe 
avoid that. They do differ as to when a task is distri- 
buted: CWN schedules a task on some PE as soon as it is 
created; the GM keeps the newly created tasks on the 


source PE, and distributes them when required. 


2.1. Contracting Within Neighborhood 


This scheme is based on the fact that allowing com- 
munication between arbitrary pairs of PEs is not scalable. 
In a system with global communication, as the number of 
PEs is increased, a point is reached beyond which the sys- 
tem is always communication bound. This is true for any 
interconnection scheme which uses a fixed number of 
connections per PE [4]. It is possible to avoid global com- 
munication in tree structured computations as the com- 
munication is almost exclusively between parent and 
child tasks. So CWN restricts a child task to be within a 
fixed radius — neighborhood — from its parent. Also, in 
the interest of agility, CWN sends every subgoal out to 
another PE as soon as it is created. 


2 In some of the large-grain load balancing literature, a distinction is 
made between the terms load balancing and load distribution. There the 
former term refers to initial distribution of work, whereas the latter refers to 
what we call redistribution of work. On fine-grained systems, the tasks are 
being created throughout the life cycle of the a computation with almost 
equal rate. We use the term load balancing to refer to the general problem 
of maintaining adequate levels of load on all processors. 


Hach PE maintains the load information about its 
immediate neighbors. This information can be a combi- 
nation of various factors that gauge the current and 
future ‘load’ on that PE. A simple measure is the 
number of messages waiting to be processed by that PE. 
This information is maintained by broadcasting a very 
short message to all the neighbors periodically, or as an 
optimization, piggy—backing the load information ‘word’ 
with regular messages. Any time a subgoal is created on 
a PE it sends a new goal message to its least loaded 
neighbor. The message also includes a count field that 
says how many hops the message has traveled from the 
source. A PE that receives such a message keeps the 
goal for processing if the hop count is equal to the 
allowed radius. Otherwise it sends the goal to its least 
loaded neighbor after incrementing the count. If a PE 
finds its own load is less than its least loaded neighbors, 
it keeps the goal provided the message has already trav- 
eled a stipulated mintmum hops. Thus, a new subgoal 
travels along the steepest load gradient to a local 
minimum. A goal, once it is accepted by a PE, remains 
there, and is finally executed by that PE. 


As it follows the local load gradients, this scheme 
may not send a given subgoal to the least loaded PE in 
the neighborhood, because of the horizon affect. How- 
ever, looking for the least loaded PE in the neighborhood 
would be expensive. The minimum hops are stipulated 
to alleviate this problem to some extent. A source PE 
cannot keep a piece of work even if it is the least loaded 
among its neighbors. It must send it some distance to 
‘look over the horizon’, and then possibly get it back. 


The scheme is naive on several counts. First, 
requiring every piece of work to be contracted out to 
another PE seems excessive. Also, once a goal reaches its 
‘destination’ it remains stuck there, which removes 
opportunities for a correction as time goes on. However, 
the strategy is meant as a starting point. The simula- 
tion studies should suggest specific ways of improve- 
ment. 


The scheme has two parameters: the radius, i.e. the 
maximum distance a goal message is allowed to travel, 
and the horizon, i.e. the minimum distance a goal mes- 
sage is required to travel. 

2.2. The Gradient Model 


The gradient model is a more elaborate scheme 
than CWN. A newly generated subgoal is simply entered 
in the local queue. A separate, asynchronous process 
handles the load-balancing functions. This process 
wakes up periodically, and computes the load on the PE 
as in CWN. Using two parameters, the low—water—mark 
and high—water—mark, it decides the state of the node as 
follows. If the load is below the low—water-mark, the 


state is tdle. If the load is above the htgh—water—mark, 
the state is abundant; Otherwise, it is neutral. It then 
computes its prozimity: The proximity of an idle node 
is 0. For others, it is one more than the smallest prox- 
imity of their immediate neighbors. All the PEs initially 
assume that the proximities of their neighbors are 0. If 
the calculated proximity is more than network diameter, 
then it is set to (network diameter +1), to avoid 
unbounded increase in proximity values. If the proxim- 
ity is different from its previous value, it is broadcast to 
all the neighbors. If the state were tdle or neutral, the 
process sleeps until the next interval. If the state were 
abundant, it sends a goal message from the local queue 
to the neighbor with least proximity. The neighbor just 
adds the message to its queue. This may change its state 
which is noticed when the gradient process on that PE 
wakes up. 


The proximity of a PE represents a guess at the 
shortest distance to an idle PE. It is a ‘guess’ because by 
the time the information about an idle PE reaches 
another PE via the update—and—broadcast—proximity 
sequence, the state of some PEs may have changed. 


The rationale behind the GM is to keep work locally 
as far as possible, and to send work out towards a PE 
that is in danger of being idle. This strategy is 
parameterized by: the low—water—mark, the high—water— 
mark, and the sleeping interval between two execution 
cycles of the gradient process. 

3. The simulation set—up 


The simulations were carried out on ORACLE, a 
multi-processor simulation system we are developing. 
ORACLE is written in SIMSCRIPT, which supports pro- 
cess abstraction. ORACLE has one process for each user 
process running on a PE, and one process for each com- 
munication channel. Thus it models contention for the 
basic resources of a parallel system. 


ORACLE accepts input specifications such as the 
number of PEs and their interconnection scheme, the 
load balancing strategy to be used (from its repertoire of 
strategies), control strategy options, form and kind of 
output information required, a program to execute and 
times to be charged for primitive operations. ORACLE 
can provide statistics on a variety of performance 
aspects such as the overall average PE utilization, aver- 
age utilization of individual PEs, average and individual 
utilizations of communication channels, and the time to 
completion. 


A point worth noting is that when we run a pro- 
gram on ORACLE, we get the result of the program, in 
addition to the performance statistics. In contrast, a 
trace driven simulation approach would be to carry out 
the computation in advance, producing a trace, which 


will then be used by the simulation system to get the 
performance figures. We found such an approach would 
not save much in terms of simulation time. Another 


approach could be to use a statistical model of computa- | 


tion. In absence of any uniform model of parallel com- 


putations, it was thought to be too unreliable and ad— 


hoc an option. So we opted for executing specific com- 
putations with well-understood structures. 


The sample points at which to compare the two 
schemes vary on many dimensions: the interconnection 
topologies, the number of PEs, the computation struc- 
ture and size, and the communication to computation 
ratio. 


We selected 2 interconnection topologies: the 2- 
dimensional grid (nearest neighbor grid) with wrap— 
around connections and the double-lattice-mesh (DLM) 
topologies. The grid was used in simulations of the gra- 
dient model by Lin [7]. The DLM is a bus—based topol- 
ogy proposed by us [4]. We also decided to simulate 
systems with 25 to 400 PEs. Beyond 400 PEs, the time 
required for simulations was prohibitive. This range 
should be sufficient to understand how the schemes will 
behave when the size of the system changes. 


To be able to interpret the simulation results, and 
get an understanding of how the load balancing schemes 
behave, we needed a predictable computation, whose 
structure is easy to grasp. Then, there won’t be ambi- 
guities about whether a certain feature that is seen in 
the simulation data is due to the nature of the computa- 
tion or due to the load—balancing scheme. We chose to 
use dtutde—and—conquer, and natve—fibonacct programs 
for these reasons. The divide-and-conquer (abbreviated 
dc) program was used by Lin, and may be written as: 
dc(M,N) «—- ifM = N thenM 
| else dce(M,(M+N)/2) + de(1 + (M+N)/2, N) 
The natve—fibonacct is the doubly recursive function to 
compute fibonacci numbers. 
fib(M) —ifM < 2 then M else fib(M-1) + fib(M-2) 

It must be pointed out that we are not really interested 
in how to compute this functions in parallel. There are 
much more efficient methods for computing them. 


We used 6 different computation sizes for each pro- 
gram. Fibonacci of 7, 9, 11, 13, 15 and 18, and the dc 
computations of the same sizes, namely: de(1,n) for 

==21, 55, 144, 377, 987 and 4181. As we wanted to 
focus on effectiveness of load distribution, we decided to 
isolate the factor of communication load. We chose the 
ratio of communication to computation to be such that 
communication stagnation does not occur. 
3.1. The optimization experiments 


Each scheme has a few parameters that have to be 
selected. In the interest of fairness, the parameters 
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must be chosen in such a way each scheme is working at 
its best. We chose a few sample points in the space of 
planned experiments, and ran the simulations for vari- 
ous combination of parameters. The winning combina- 
tions were used for the comparison experiments. The 
parameters so chosen are shown in the table below. 

It is worth noting that the 20 units interval is fairly 
low, as the total execution time for simulations ranged 
from 1000 to 23000 units. That means the gradient pro- 
cess is running very frequently, which should be an asset 
to its performance. Also, we assume a communication 
co—processor to handle the routing and load-balancing 
functions (for both strategies). Without such a co- 
processor, the gradient model will suffer more, because 
it executes a more complex code and more frequently. 


Table 1: Selected Parameters 
4. Simulation Results, and Interpretation 


The choices of sample points mentioned above lead 
to 240 simulation runs (2 problem types * 6 problem 
sizes * 2 topology types * 5 topology sizes * 2 stra- 
tegies). The simulations were run on a VAX Each run 
took between 15 minutes to 3 hours of time on a Vax- 
750. 


Plots 1 through 6 show the performance of the two 
schemes on the divide-and—conquer computations. (See 
[5] for the complete set of plots, including simulations 
for hypercubes). Each plot depicts experiments done on 
a specific topology, for one problem type. Thus Plot 1 
shows the results of 6 dc computations of varying sizes, 
running on a double-lattice-mesh with 400 (20x20) PEs. 
The Y-axis shows the average PE utilization in percents. 
The X-axis is the problem-size in total number of goals 
generated during the computation. The speedup can be 
computed by multiplying the number of PEs by (average 
utilization percentage/100). 


On the grid topologies, the CWN is a clear winner 
by substantial margins. On the double lattice—meshes 
also CWN consistently performs better than the GM. 
The only one case seen in these plots where CWN is out- 
performed by the GM occurs in plot 2, while running 
dc(1,4181) on a DLM with 100 PEs. 


The comparative figures from all the runs are 
shown in table 2. For each run, we show the ratio of 
speed—ups obtained using CWN to that obtained using 


GM. In 118 out of 120 cases, the CWN is seen to be 
better. In 110 of those cases, the difference is 
significant, i.e. more than 10%. On grids at times the 
CWN leads to thrice as much speed as (i.e. the response 
time) GM. 


The DLM topologies have smaller diameters (4-5) 
compared to the grids (ranges from 8 to 38). The supe- 
rior performance of CWN on the grids leads us to conjec- 
ture that it performs better than the GM on large sys- 
tems, which of course tend to have larger diameters. 


To understand the operation of each method, we 
plot the utilizations during short sampling intervals 
throughout the course of computation, for a few selected 
computations. Plots 7 through 9 show the utilization as 
time varies for 3 Fibonacci computations on both topo- 
logies with 100 PE. The CWN has much faster ‘rise— 
time’ than GM: it spreads work quickly to all the PEs at 
beginning. The pitfalls of CWN are also seen, e.g. in 
Plot 7 and 8. Although it takes the system close to 
100% utilization quickly, it cannot maintain the perfor- 
mance at that level. The Gradient model manages to 
maintain 100% when it reaches that level (plot 7). This 
is because of GM’s ability to re—-distribute work. For 
CWN, once a goal is sent to a PE, it must be executed 
there, although the load conditions may change after 
that. It can correct such imbalances only by using 
newly created goals, which limits its ability to supply 
work to idle processors. 


The main problem with GM is that it is not agile 
enough. PEs hoard work until they are sure they are 
‘abundant’. On the grids, a stronger flattening is seen 
(plot 9). When about 40% of the PEs have received 
work, most PEs think there is not sufficient work to dis- 
tribute it to others, and so keep the new goals they gen- 
erate, which leads to loss of parallelism, and as a result 
not enough work gets generated. This ‘vicious cycle’ is 
responsible for the flattening of the plot. 


Examination of the detailed simulation output, not 
shown here, reveals another potential problem with 
CWN. Typically, it requires thrice as much communica- 
tion as the GM. In GM, the average distance traveled by 
a goal message is typically less than 1. A significant 
number of goals just stay at the PE they were created 
on. On the grids, with CWN the distance traveled is 
about 3. For example, in computing fib(18) on a 10x10 
grid, the average distance was 3.15 for CWN and 0.92 for 
GM. 

5. Conclusions and Future Work 


Although CWN performs better than GM in most 
experiments reported here, it still has a large room for 
improvement. First, CWN does not allow a goal to be 
re—distributed once it has been sent to another PE. As 


seen in Plots 7 and 8, the available work is just 
sufficient to keep every PE busy, but as the CWN cannot 
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re-shuffle work, some PEs remain idle. However, re— 
shuffling is not useful when the work is more than 
sufficient or when it is too little. So, a small, well- 
controlled (i.e. responsive to run-time conditions) re- 
distributton component should be added to CWN. Also, 
the larger communication distances indicate that CWN 
needs saturation control: When the system is running at 
100% utilization, there is no need to send every goal out 
to other PEs. Detecting such a situation and then keep- 
ing goals locally until the situation changes would be 
worth investigating. Both of these amount to incor- 
porating the good features of GM in CWN. Care must be 
taken not to lose the agility of CWN while modifying it. 


A note of caution is in order. We chose a low com- 
munication to computation ratio to ensure that com- 
munication stagnation does not interfere with the pro- 
perty we were trying to measure: namely, the ability to 
distribute computation load effectively. When the ratio 
is higher, CWN, as it is, may lose some of its edge. 
Techniques of the last paragraph will then be necessary. 


Acknowledgement: I am grateful to Michael Carroll, 
Valerie Rasmussen, and Wennie Shu for their help with 
the simulations. 
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University of Pennyslvania,Philadelphia, PA 19104 


Abstract — Resource migration in a distributed com- 
puter system can be performed for performance enhance- 
ment as well as for reliability or availability improvement. 
The intractability of the general load balancing model with 
both job and resource migration suggests obtaining ap- 
proximate solutions. The existing approach is to use heuris- 
tic rules to find approximate solutions. In this paper, 
we adopt an alternative approach of separating the job 
and resource migration problems and propose an approx- 
imate model (commodity distribution) for resource mi- 
gration which can be solved by a polynomial-time algo- 
rithm. We demonstrate the application of this model to 
two load balancing problems: file migration in distributed 
databases and host migration in mobile computer net- 
works. We also outline our efficient algorithm for solving 
a special case of this model. 


Introduction 


A resource in a computer system is defined as any hard- 
ware or software entity required for the execution of a 
user job. Examples of resources include processors, memo- 
ries, interconnection networks, system processes, data files, 
database relations and file servers. In a distributed com- 
puter system, some of these resources are distributed amo- 
ng the various nodes in the system. If the distribution of a 
resource among the nodes can vary with time, then we call 
this resource a, ‘migratable’ resource. Examples of such re- 
sources include datafiles, processes and mobile hosts. Tra- 
ditionally, the term ‘load balancing’ refers to the operation 
of distributing or redistributing the user tasks among the 
different nodes in a distributed system, to achieve a de- 
sirable performance level; typical performance measures 
include job response time, throughput and processor uti- 
lization. We extend this definition of load balancing to 
include the operation of distributing or redistributing the 
migratable resources of a computer system to achieve a de- 
sirable performance level. Redistribution of the user jobs 
among the nodes in the system is known as job migra- 
tion. We call the redistribution of migratable resources as 
resource migration. 

Load balancing models without resource migration have 
been extensively studied in the literature (e.g. [3], [5]). 
We give a few examples of applications where resource mi- 
gration is also used for load balancing. In a distributed 
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database system, file migration is performed in order to 
maintain at all times, a desirable relation between the file 
access rates and the distribution of file copies among the 
nodes. Another example of resource migration in a dis- 
tributed computer system can occur when a job in one host 
needs the services of a system process such as a file server, 
query processing program and editor process, running on 
a remote host. Here, instead of sending the request to the 
remote host and transferring the results back, the required 
process itself can be migrated from the remote host. An 
example in which a processor itself can migrate is a mo- 
bile computer network which consists of mobile hosts and 
in which the topology can change from time to time. 

An important issue in load balancing with both job and 


resource migration is the problem of deciding which jobs 


or resource units to migrate. We refer to this problem the 
general load balancing problem with resource migration. 
In this problem, it is necessary to find a proper distribu- 
tion of resources and jobs among the various nodes of the 
system so that the desired trade-off occurs between the 
job migration cost and the resource migration cost. In one 
formulation of this problem, the total migration cost (of 
jobs and resources) is minimized. This optimization prob- 
lem had been shown to be NP-hard ([2]). The total cost 
criterion is useful when the resources and the jobs have to 
be migrated one at a time as for example, when a single 
broadcast bus such as Ethernet is used for migrating the 
resources and the jobs among the nodes. The bottleneck 
cost criterion is more appropriate when the resources and 
the jobs can be migrated in parallel. The load balanc- 
ing problem with the bottleneck cost criterion can also be 
shown-to be NP-hard. We omit the problem formulation 
and the proof of its intractability here (see [7]). 

The exact solutions for the load balancing problem 
need either exhaustive or heuristic search procedures, all 
of which are prohibitively expensive to be executed in real 
time. As a result, approximate solutions are usually used 
instead for the problem. One approach which is common 
to all the existing techniques is to use heuristic rules for 
guiding the search to an approximate solution. However, 
the heuristic rules for obtaining approximate solutions are 
generally difficult to.derive and the accuracy of the solu- 
tions are hard to verify. In this paper, we propose a new 
approximation approach to solve the general load balanc-: 


ing problem. 

In the new approach, first we feces on resource migra- 
tion only but with a view to reducing the job migration 
costs. For this resource migration problem, we propose 
an approximate model which can be solved in polynomial 
time. This approximate model partitions the given system 
into regions such that all the jobs as well as the resources 
in a region have similar characteristics. The partitioning 
helps to achieve local approximations for the job and re- 
source characteristics (the smaller the regions, the better 
the approximations) as well as to separate job and resource 
migration problems. By suitable partitioning, good ap- 
proximate solutions to resource migration problem can be 
obtained. The approximate model for resource migration 
reduces to a bottleneck transportation problem and hence 
can be solved by a polynomial-time algorithm. 

In the next section, we introduce our approximate mod- 
el for load balancing and also give a brief outline of an 
effcient algorithm we have developed for solving a special 
case of this problem. In Section 3, we discuss in detail the 
application of this model to file migration problem in dis- 
tributed databases and the simulation results on two small 
examples. In Section 4, we discuss briefly the application 
of the model to host migration in mobile computer net- 
works. Finally, we give conclusions and future directions. 


_Approximate Model for Resource Migration 


First, we partition the given distributed system into a cer- 
tain number of regions (say m) W,, W2,...,W, such that 
all the jobs within a region have similar characteristics such 
as resource requirements. In our model, we only consider 
the migration of the resources but not the jobs among the 
regions. It may be possible that even after resource migra- 
tion, a job at a node may need a resource which may not 
be available at the same node. In this case, the request 
for the resource can be processed remotely at some node 
within the same region or the job itself can be migrated 
to the node containing the resource. In either case, after 
resource migration, all the resource requirements within 
a region must be met by the resources that exist in the 
same region; this restriction is specified as a constraint in 
our problem. 

_ A word on notation. We denote the set of non-negative 
integers by A’. Now we define the following parameters: 


fr — fixed cost of migrating one unit of resource 

cr — unit distance resource migration cost 

b; — average resource requirements for a job in region W; 
M; — Number of jobs in region W; that need the resource 
N; — Number of resource units in region W; before migra- 
tion 

h; — average time for a node in region W; to communicate 
a resource request to a remote node within W; and send 
the results back 

a; — average time to process a resource request in region 
Ww, 

d;; — average distance between regions W; and W; 

T’ — desired average job response time 


The quantities a; and b; are averages over all the nodes 
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and the jobs respectively while h; is an average over all 
pairs of nodes in the region W;. The quantity d;; is an av- 
erage over all pairs of nodes (v1, v2) such that v; € W; and 
v2 € W;. We assume that the total number of resource 
units remains the same after resource migration. 

With the bottleneck migration cost criterion, the re- 
source migration problem is formulated as follows: 


Minimize GR FF + cr.djk) 
ay re < T,foralll<j<m 
dik=1 2kj 
ba = N,,foralll<j<m 
ee E€ WN, foralll <j,k<m. 


The variable z;, denotes the number of resource units that 
need to migrate om the region W; to the region W;. The 
quantity Rj = pia act oy ee is the minimum resource capacity 
(expressed i in ee of resource units) needed to meet 
the requirements of jobs in region W;. The quantity ¢;, = 
(fr+cr.djx) is the cost of migrating a resource unit from 
the region W; to the region Wx. 
With these notations, we reformulate the load balancing 
problem as follows : 
Minimize max — tj 
{(3,4)lzjx>0} 


m 
s.t. So Zhi > R,;,foralll<j<m 
k=1 
m 
S> 25K = N,,foralll<j<m 
k=1 
zjzr € WN, foralll <j,k<m. 


The above formulation is known as the bottleneck trans- 
portation problem in the Operations Research literature. 
It is also possible to define two sets of regions, one for 
the resource distribution before migration and the other 
for the distribution after migration. We call these regions 
“supply” regions and “destination” regions respectively. 
The supply regions can be defined depending on the basis 
of job characteristics and also on the network character- 
istics to a certain extent. The destination regions can be 
defined on the basis of resource migration costs. Note that 
the number of supply regions need not be the same as the 
number of destination regions. This variation in the model 
provides a close approximation to the exact model we have 
defined and also it can be used to reduce the dimension of 
the problem with very little additional approximation. 
Several efficient algorithms for the bottleneck trans- 
portation problems have been proposed in the literature 
(e.g. [4]). One special case of the problem is the 2 x n 
(n x 2) bottleneck transportation problem in which there 
are two suppliers (destinations) and n destinations (sup- 
pliers). We have developed an O(n?) algorithm to solve 
this special case. We will give a brief outline of our algo- 
rithm here (refer to [6] for details). The algorithm for the 
2 x n problem uses the fact that there is an optimal solu- 
tion in which at most one destination needs to be supplied 
by both the suppliers. Hence we restrict our attention to 


only those solutions which satisfy this property. At each 
iteration, a new feasible solution is obtained which has a 
bottleneck value less than or equal to that of the previous 
solution. An upper bound for the optimal value is implic- 
itly defined by each new feasible solution. In addition, we 
determine at each iteration, the optimal shipments and 
the corresponding suppliers for one or more destinations. 
Hence after each iteration, a new bottleneck problem is 
solved with these destinations eliminated. The bottleneck 
value among the eliminated destinations is used to define 
a lower bound for the optimal value. The algorithm stops 
when the lower bound is greater than or equal to the upper 
bound or when the optimal shipments can be determined 
for all the destinations. 

To summarize our approach to solve the load balanc- 
ing problem, we use an approximate model to obtain an 
approximate solution for the exact model of the load bal- 
ancing problem with resource migration. This approach 
avoids using heuristic rules to obtain the approximate so- 
lutions directly from the exact model. The heuristic rules 
are difficult to derive in many instances and do not take 
into account the specific nature of the applications. In the 
approximate model we propose, how the system is par- 
titioned into regions affects the accuracy of the solution 
and this partitioning can exploit the characteristics of the 
specific application domain. For example, in the case of 
local area networks gatewayed together, each local area 
network with its resources can be considered a region in 
the approximate model. 


In the next section, we discuss in detail the application 
of this load balancing model to the file migration problem 
in distributed systems. 


Database File Migration 


In a distributed relational database, relations are parti- 
tioned either vertically (across attributes) or horizontally 
(across instances of the relations) into possibly overlapping 
fragments which are referred to as database files. These 
files are distributed (also replicated) across the nodes in 
the network. A transaction submitted by the user at a 
node can be either a query or an update on the database. 
The update transaction translates into requests to all the 
relevant sites or nodes for updating the appropriate files. 
We do not address the query decomposition problem here. 
If the file required by a subquery or a query does not exist 
locally (i.e. on the node at which the query is generated) 
then the subquery is sent to a remote node containing the 
file and processed there. We will explain later how a re- 
mote node is chosen for processing a subquery. 

All the database transactions thus generate over a pe- 
riod of time an access pattern for the various database files; 
these access requests are classified into query and update 
requests. When the locality of file access patterns changes 
with time, files (along with the programs to process the 
subqueries) need to migrate among the nodes. For simplic- 
ity, we only consider the problem of single file migration 
(that is, the problem of migrating the multiple copies of a 
file). 
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The following costs are incurred during file migration: 
(1) costs of file and program storage , (2) costs of updat- 
ing all file copies, (3) costs of file migration and (4) query 
costs. In determining optimal file migration, we want to 
minimize the first three costs while maximizing the av- 
erage query throughput or minimizing the average query 
response time. The query response time consists of the 
following two components: (1) query communication time 
which includes the time for sending the request and receiv- 
ing the results back and (2) query processing time which 
includes the time for processing the file access request. In 
formulation of the file migration problem, we make the 
following assumptions: 


1. A constant number of file copies is maintained at all 
times. 


2. Due to the first assumption, whenever a new file copy 
is generated at a node, some other existing copy at 
another node needs to be deleted. 


3. All file copies can migrate in parallel. 


4. Query communication delay is independent of the 
query traffic and is dependent only on the commu- 
nication distance. We assume that the nodes have a 
limited processing capacity (expressed in file access 
requests per unit time) and the processing capacity 
is the same for all nodes. Hence the query process- 
ing delay is directly proportional to the query traffic 
directed to the node containing the file copy. 


These assumptions, justifiable in many instances, are made 
primarily to simplify the presentation of our model, and to 
illustrate that even under these simplifying assumptions, 
the problem is already NP-hard. We use the following no- 
tations: 


V — set of all nodes in the system and P(V) its power set 
I,I' — set of nodes containing file copies after and before 
migration 

f:(U-Tl') - I' — migration function specifying how the 
files migrate 

g:V — I — query assignment function specifying where 
a query request from a node needs to be processed 

h: I — P(V) — indicates for a file copy node, the set 
of nodes whose queries need to be processed by that node 
(i.e. inverse function of g) 

n — number of nodes in the system 

q — query processing capacity of a node (file access re- 
quests per unit time) 

Uz,0, — update and query request rates from node z 

Mz y,Sz2,y — unit update and query communication costs 
from node z to node y 

F, — cost per unit time of storing a file copy at node y 
Ezy — cost of migrating a file copy from node z to node 


6: I —-- N — query delay function indicating the average 
delay in processing a query at a node 
T — desired maximum average query response time 


There are three cost components, namely the file copy 


overhead cost (denoted by U(I)), the query cost (denoted 
by Q(/,g)) and the migration cost (denoted by R(/, f)). 


These costs are defined as follows: 


RU,f) = max, Eyuy 
UI) = ale UzMey + Fy] 
yel xEeV 
Q(I,9) = max [seate) + 5(0(2)) 


Here 6(y) = Deen(y) b/g. The general file migration prob- 
lem is posed as follows: Find I, g and f (injective) such 
that R(J, f) is minimized with the constraint that 
Q(1,9) < T, U1) < C (f and C are positive constants) 
and |J| = |J’|. 

A different formulation of the file migration problem is 
given in [8]. In this model, the total cost criterion is used 
‘for all the above three costs and the sum of all these three 
costs is minimized while fixing g and f as follows: g(x) = 
Minyer Sz and f(z) = minyey Ey. Also in this model, 
Q(I,9) = “rev be g(x), that is, the query processing time 
is assumed to be dependent only on the communication 
cost but not on the query traffic. In our formulation (we 
call it “bottleneck file migration problem”), we separate 
the query cost into a constraint on query response time. 
This formulation is useful to guarantee a maximum query 
response time. 

We can easily show that the bottleneck file migration 
problem is NP-hard even if there is no constraint on the 
file copy overhead cost (U(JI)); for the proof see [7]. We 
will illustrate how our approximate model can be used to 
solve this problem. In our approximate model, we simplify 
the problem by first eliminating the constraint on the file 
copy overhead cost. In most applications, when a con- 
stant number of file copies is maintained, this overhead 
cost would not differ significantly among different alloca- 
tions of these copies to the nodes. Next we separate the 
query access problem (i.e., determining g) from the file 
migration problem (i.e., determining f and J). For this, 
we partition the set of nodes into regions W,, W2,..., Wm. 
Let Ij, 15,...,1/, be the corresponding sets of nodes in the 
regions containing file copies before file migration; that is, 
I, = {x € W,|x € I’}, for 1 < 7 < m. One of the pos- 
sible rules for partitioning is discussed below. First we 
introduce the following additional notations: 


number of file copies in W; before migration 
= |{z € Wjlz € J’) | 


L; = number of nodes in region W; (= |W;|) 
Si = average query communication cost in region W; 
= Veyew, Seu/(L; — L;) 
‘,k = average cost of file copy migration from W; to W, 
= diver! Dl ye(W—I) Exy/(Nj x (Li = Nx)) 
Q; = total query traffic in region W; (= Drew; be) 
Zj,k = number of copies to be migrated from W; to W, 


One possible rule to use in the partitioning is to require 
that the nodes in each region have similar migration cost 
characteristics and communication cost characteristics. 

More formally, for 1 < 7 < mand z,y € W; with « # y, 
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|szy—S%| < 1, for €; very small. Alsofor1 < j,k <m,z € 
W; and y € Wy, |Ecy — E,,| < €2 for €2 very small. Larger 
the values of €, and €2, larger will be the number of regions 
and greater will be the time to solve the problem. On the 
other hand, smaller these values, smaller the number of 
regions but further from optimality the solution from this 
model will be. Thus a “good” partitioning rule makes 
a suitable trade-off between accuracy and time and this 
trade-off depends on the specific distributed system under 
consideration. | 

Now we formulate the file migration problem as follows: 


Minimize max Fi, 
{(5,k)|z3n>0}  ~ 
0; T,foralll<j< 
s.t. Si + ———4+—_ << oralll<j<m 
: be an 2hj-] 
EF: = N;, for all<j<m 
k=1 
zjze € MN, foralll <j,k<m. 


As in Section 2, if we introduce for each 7 = 1,2,...,m, 
the quantities R; = Fee then the problem becomes 


a bottleneck transportation problem. A; represents the 
minimum number of file copies needed in region W; to 
satisfy all the query requests within region W;. We require 
here that any query request within a region be directed to 
a location within the same region. 

In our preliminary investigation of the performance of 
this model, we considered the five-node example given in 


[1] and an eight node example. We approximated the file 
migration problems in these examples by the 2 x n bot- 
tleneck transportation models and used the algorithm we 
have developed to solve these problems. For each exam- 
ple, we considered different partitioning of the system into 
regions. Since the diameters of the graphs (with respect 
to query communication cost) in the examples were small 
compared to the response time, only the second partition- 
ing rule (that is, the one requiring the regions to have 
similar communication characteristics) could be tested in 
this experiment. For each partition, we had 500 runs and 
in each run, we varied the file access (both the query and 
the update request) pattern randomly according to an uni- 
form distribution. For each run, we compared the optimal 
solution of the exact model with the optimal solution of 
the approximate model based on whether the number of 
file copies in each destination region is the same in the 
two solutions. When the nodes within a region have sim- 
ilar communication cost characteristics, about 50% of the 
runs gave solutions that agreed with the optimal solutions. 
For the arbitrary partitions, this figure varied from 2% to 
25%. Thus the partitioning rule we have mentioned before 
has a significant impact on the “goodness” of the solutions. 

Though the file copy overhead cost was not considered 
in the approximate model, we also compared these costs 
for the exact and the approximate models; for the approxi- 
mate model, these costs can only be estimated since the ex- 
act locations of the file copies are unknown. For a “good” 
partitioning such as the one in which nodes within a re- 


gion have small costs to communicate within themselves, 
the difference in these costs averaged to within 20% over 
all the runs. Thus the file copy overhead does not ap- 
preciably change due to elimination of this cost from the 
approximate model. These results are encouraging consid- 
ering that only a simple partitioning rule has been used 
in these examples. We plan to perform a more extensive 
analysis of the performance of the proposed model and of 
different partitioning rules in particular, by applying it to 
larger scale examples. 


Host Migration in Mobile Computer Networks 


A mobile computer network consists of mobile hosts which 
communicate with each other, using wireless radio chan-, 
nels. Thus the topology of a mobile computer network 
changes from time to time. The mobile computer net-' 
works are becoming a commercial reality due to the rapid 
advances that are being made in mobile communication 
technology. Another practical example of a mobile com- 
puter network arises in robotics applications, where a team 
of robots is employed to perform certain tasks in a co- 
operative manner. The robots in this case, in addition to 
having the processing power, also must have the ability to 
communicate with each other within an operational area 
like mining fields and other harsh environments. 

In a mobile computer network, the mobility of the hosts 
can be used to advantage in balancing the workload among 
the hosts. In the load balancing problem in a mobile com- 
puter network with homogeneous hosts, the objective is to 
minimize the host migration cost with a constraint on the 
job response time. There may also be an additional con- 
straint in that the hosts must be able to communicate with 
each other at all times either directly or using many hops. 
Further we may also require that the results of running a 
job must be available at a host which can at most be at a 
distance r from the location of the host to which the job 
is submitted. 

Since this problem is NP-hard (see [6]), we can use 
the approximate model proposed in Section 2 to solve this 
problem. For this, we partition the network into geograph- 
ical regions. One partitioning rule to use is as follows: the 
distance between any two nodes in a region is less than 
r; smaller the value of r, more the number of regions and 
vice versa. The value or r also affects the “goodness” of the 
solutions obtained through the approximate model. The 
network can also be partitioned on the basis of the struc- 
ture of the backbone network that is usually defined for 
communication among the hosts. In a mobile computer 
network, hosts are divided into clusters with a clusterhead 
for each cluster. A backbone network connects the cluster 
heads in some configuration. The hosts within a cluster 
communicate with each other directly while the interclus- 
ter communication takes place using the backbone net- 
work. ‘A natural choice of regions here will be the clusters 
themselves. We can then formulate the approximate load 
balancing model in the same fashion as defined in Section 
2. The bottleneck transportation problem here determines 
the number of host units that should migrate from one re- 
gion to another such that the bottleneck host migration 
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cost is minimized while guaranteeing the maximum aver- 
age response time. 


Conclusions 


‘We have proposed a commodity distribution model as a 
tractable approximate model for load balancing with re- 
source migration. If the given distributed system is par- 
titioned on the basis of job and network characteristics, 
then the approximate solution is reasonably close to the 
optimal solution of the exact model. This has been demon- 
strated to a certain extent in the case of file migration 
in distributed data bases. A special case of this prob- 
lem can be solved by an efficient algorithm which we have 
developed. This resource migration model must be sup- 
plemented with appropriate job migration models for load 
balancing within each region. Though we have demon- 
strated that partitioning rules have an impact on the ac- 
curacy of the solutions, a formal analysis supported by 
experimental results is necessary. This is the subject of 
our future investigation. 
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An Efficient Termination Detection and Abortion Algorithm 
for 
Distributed Processing Systems 


Kazuaki Rokusawa Nobuyuki Ichiyoshi 


Abstract 


This paper describes an algorithm for termination detection 
and abortion in distributed processing systems, where pro- 
cesses may exist not only in processing elements but also 
in transit. The algorithm works correctly whether the com- 
munication channels are first-in-first-out or not, and no ac- 
knowledgement message is required. Assigning weights to 
all processes and maintaining the invariant that the sum of 
the weights is zero are the main features of the algorithm. 
1 Introduction 

Termination detection and abortion of all processes in a 
system are major functions in parallel processing. They 
are easy in closely-coupled systems, such as shared memory 
multiprocessors, but difficult in distributed systems, partic- 
ularly when there are processes in transit. 

We have devised an algorithm for termination detection 
and abortion in distributed processing systems, where pro- 
cesses may exist not only in processing elements but also in 
transit. This algorithm is called the weighted throw count- 
ing scheme, which is an application of the weighted refer- 
ence counting scheme [1] [5], a garbage collection scheme for 
parallel processing systems. | 

The algorithm will be applied to parallel implementation 
of KL1, a parallel logic programming language based on 
GHC [4], on the Multi-PSI [3], a collection of Personal Se- 
quential Inference Machines [6] (PSI’s) interconnected by a 
fast communication network. 

This paper is organized as follows. Section 2 defines the 
computation model employed. Section 3 shows the prob- 
lems of termination detection and abortion in distributed 
systems. A naive solution is presented in section 4. Sec- 
tion 5 describes the algorithm for termination detection 
and abortion where the communication channels are first- 
in-first-out. The algorithm for the system with non-first-in- 
first-out communication is presented in section 6. Finally 
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the comparison of the algorithm with the naive one is given 
in section 7. 


2 Computation Model 


The following process model is assumed: 


e A process pool consists of one controlling process and 
a finite number of child processes; 


There are a finite number of process pools in the sys- 
tem; 


Each process pool is assigned a unique process pool 


identifier (PID); 
A child process can terminate at any time; 


A child process can generate another child process hav- 
ing the same PID and a new process pool having a new 


PID as well. 


In this paper, “process” means “child process” unless oth- 
erwise indicated. A process pool terminates if all the chil- 
dren terminate. Aborting a process pool is forcing all the 
children to terminate. A process pool described above is 
distributed over the following machine: 


e A finite number of processing elements (PEs) intercon- 
nected by a communication network; 


e No global storage; PEs may communicate by passing 
messages; 


e Asynchronous communication, in which messages are 
delivered with arbitrary finite delay. 


It is assumed that a PE can detect the termination of 
all processes in it having the same PID and can force them 
to terminate. The controlling process and PEs can com- 
municate in both directions. A PE may send a message to 
the controlling process informing it of the termination of all 
processes, and the controlling process may send a message 
to abort processes. 

Although there exist a finite number of process pools in 
the system at a given time, there is no limitation of total 
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: Controlling process B 
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Figure 1: Computation Model 


number of process pools, since any process can generate a 
new process pool at any time. 

Processes may migrate among PEs for load balancing. To 
achieve this, a PE may throw a process in the PE to an- 
other PE and the thrown process is delivered with arbitrary 
finite delay. Therefore, at a given time, processes may be in 
transit in the communication network but not in any PEs. 


3 Problems 


This section describes why termination detection and abor- 
tion of processes distributed over several processors are dif- 
ficult, particularly when there are processes in transit. 


3.1 Termination Detection 


The controlling process must detect the termination of all 
processes having the same PID as the controlling process. 

Each PE can detect the termination of all processes with 
the same PID in the PE locally and can send a message 
indicating termination (terminated message) to the corre- 
sponding controlling process. 

However, even if the controlling process receives termi- 
nated messages from all PEs, it is not sure that all processes 
have terminated. There may be processes in transit, which 
will be received by a PE after the PE has sent a terminated 
message. 


3.2 Abortion 


_ The controlling process must force to terminate all processes 
having the same PID as the controlling process. 

If the controlling. process broadcasts a message causing 
a process to terminate (abort message), it is possible to 
abort all the processes in the PE, but impossible to abort 
the processes in transit. After receiving an abort message 
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and aborting the processes, the PE may receive a thrown 
process. 

If a PE memorizes the PID carried by the abort mes- 
sage, and, ignores received processes with the same PID as 
that memorized, the abortion by broadcast scheme described 
above may work. However, this scheme has disadvantages. 
First, if only a few PEs have the process to be aborted, 
most of abort messages are useless. Second, it is impossi- 
ble to reuse a PID, because the controlling process cannot 
detect the termination of the abortion; this is a major dis- 
advantage. 


4 The Naive Scheme 


Ichiyoshi et al. [2] describe a termination detection scheme 
using acknowledge messages. It effectively does the follow- 
ing, although different terminology is used. A non-empty 
set of processes in one PE having the same PID forms a 
subpool of processes, which is called a “process subpool”, 
or a “subpool” in short. Processes in a PE are under the 
control of a subpool. On receiving a thrown process, the 


PE decides whether there is already a subpool having the 


same PID as the thrown process. If there is, the PE adds 
the process received to the subpool and sends back an ac- 
knowledge message; otherwise, creates a new subpool and 
memorizes the sender PE of the process in it. Each subpool 
has a counter which is incremented on throwing a process, 
and is decremented on receiving the acknowledge message 
or terminated message. When all processes in it are termi- 
nated and the value of the counter reaches zero, the subpool 
terminates and sends a terminated message to the PE mem- 
orized. 

This scheme is simple and termination can be detected 
correctly; if the value of the counter reaches zero, there is 
neither process thrown from the corresponding subpool in 
transit nor subpool created by the thrown process from the 
corresponding subpool. However, it has a serious disad- 
vantage; termination of a subpool depends on terminations 
of other subpools. Since subpools form a tree structure, a 
root cannot terminate unless all its leaves terminate. In 
the worst case, a chain of subpools is created, where each 
subpool terminates sequentially. 


5 The WTC Scheme 


We have devised a new scheme which requires no acknowl- 
edge message and makes it possible to reuse the PID. This 
new scheme is the weighted throw counting (WTC) scheme 
which is an application of the weighted reference counting 
scheme [1] [5], a garbage collection scheme for parallel pro- 
cessing systems. 


5.1 Termination Detection 


We associate weight with the controlling process, each pro- 
cess and each subpool. The weight of a process in transit 
and that of a subpool are positive integers, while the weight 


of the controlling process is a negative integer. The WTC 
scheme maintains the invariant that: 


The sum of the weights is zero. 


This ensures that the weight of the controlling process 
reaches zero if and only if all processes terminate; there is 
no processes neither in a PE nor in transit (see figure 2). 

When a PE throws a process from a subpool, the PE as- 
signs a weight to the thrown process and subtracts the same 
amount from the weight of the subpool. The new weight of 
the subpool and that assigned to the thrown process should 
both be positive, and the sum of the two weights is equal 
to the original weight of the subpool. For example, if a 
subpool originally weighs 1000, the weight of a thrown pro- 
cess and the new weight of the subpool can be set to 50 
and 950. When a PE receives a thrown process, it adds the 
weight assigned to the received process to the weight of the 
subpool having the same PID. If there is no subpool with 
the same PID, a PE creates a new subpool containing the 
received process and sets its initial weight at the weight of 
the received process. 

When the weight of a subpool becomes one, the PE can- 
not throw a process, because non-zero weight must be as- 
signed to the thrown process and non-zero weight must re- 
main also in the subpool after throwing. The operation 
when this situation occurs is described in section 5.3. 

When all processes in it are terminated, the subpool ter- 
minates and sends a terminated message to the correspond- 
ing controlling process. This terminated message gives no- 
tification of the termination of the subpool and carries the 
weight of the terminated subpool. On receiving a termi- 
nated message, the controlling process adds the weight car- 
ried by the terminated message to its (negative) weight. If 
the weight of the controlling process reaches zero, the ter- 
mination of all processes is detected. 


5.2 Abortion 


This section describes an abortion scheme for the computa- 
tion model with first-in-first-out communication; messages 
are delivered in the order sent. A scheme without this as- 
sumption is described in section 6. 

The controlling process should be able to force all pro- 
cesses with the same PID as the controlling process to ter- 
minate, and detect the termination of all processes to reuse 
the PID. Termination is detected using the WTC scheme 
described above. Thus, only delivery of the abort message 
to each PE containing the subpool is required. To achieve 
this, the controlling process needs to detect the creation of 
a subpool and to send an abort message to a PE containing 
a subpool. 

We introduce here a new message, named the ready mes- 
sage which gives notification of the creation of a subpool. 
On creation of a subpool, a PE sends a ready message to 
the corresponding controlling process. On receiving a ready 
message, the controlling process memorizes the sender PE, 
which is deleted on receiving a terminated message. 
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Figure 2: The WTC Scheme 


The controlling process performs the following operations 
to achieve the abortion: 


(1) Sending an abort message to each PE memo- 
rized; 


(2) Sending an abort message to the sender PE of 
a ready message received after operation (1). 


Once the controlling process receives a ready message, a 
subpool may exist in the sender PE until a terminated mes- 
sage is received from the same PE. The controlling process 
therefore performs operation (1), which aborts all subpools 
already detected by the controlling process. Operation (2) 
aborts such subpools that were not recognized by the con- 
trolling process when operation (1) was carried out: a sub- 
pool that is created after operation (1), or created before 
operation (1) but whose ready message is still in transit. 

It is necessary to assign a weight to an abort message 
like the thrown process, while not necessary to a ready mes- 
sage, because once the controlling process receives a ready 
message, it will receive a terminated message later from the 
sender PE of the ready message (the FIFO assumption). 

On receiving an abort message, a PE performs either of 
the following operations: 


(3a) Forcing the subpool with the specified PID 
to terminate, and sending back a terminated 
message which carries the sum of the weight 
of the terminated subpool and the abort mes- 
sage; 

(3b) If there is no subpool having the specified — 
PID, sending back a return message which 
carries back the weight assigned to the abort 
message. 


Figure 3 shows the abortion operations described above. 
When a subpool terminates before receiving an abort mes- 

sage, an abort message may reach a PE having no subpool 

with the same PID as the abort message. In this case, 
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Figure 3: Abortion Operations 


operation (3b) is performed and the return message is sent 
as the response to the abort message. On receiving a re- 
turn message, the controlling process adds the weight of 
the message to its own weight. If the weight of the control- 
ling process reaches zero by this operation, the termination 
of all processes is guaranteed. 

During the operations of abortion, the following cyclic 
situation may occur. The controlling process sends an abort 
message to abort a subpool. A process is thrown from the 
subpool before the abort message arrives. The thrown pro- 
cess is delivered to a PE where there is no subpool having 
the same PID as the thrown process. Then a new subpool 
is created and a ready message is sent. On receiving the 
ready message, the controlling process sends again an abort 
message to abort this newly created subpool. 

On receiving one abort message, one subpool is aborted 
and the non-zero weight of the subpool is sent back to the 
controlling process. Since the sum of the weights of subpools 
and processes in transit is finite, all processes can be aborted 
by sending a finite number of abort messages, even if the 
above situation occurs. 
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5.3 When the Weight becomes One 


As mentioned in the section 5.1, when the weight of a sub- 
pool becomes one, the PE cannot throw a process. 

In this case, the PE sends a message requesting more 
weight (request message) to the controlling process. Pro- 
cess throwing is suspended until the weight of the subpool 
becomes more than one. On receiving a request message, 
the controlling process sends back a message which carries 
some weight to the sender PE (supply message) and reduces 
the same amount from its own weight. When a PE receives 
a supply message, it adds the weight carried by the supply 
message to the weight of the subpool, which enables it to 
throw any suspended processes. Since receiving of a thrown 
process also increases the weight of the subpool, a subpool 
may terminate before receiving a supply message, and a sup- 
ply message may reach a PE that contains no subpool. In 
this case, a return message is sent back to the controlling 
process. This is similar to the action when a PE without a 
subpool receives an abort message. | 

It is not necessary to assign any weight to the request mes- 
sage, because a terminated message is delivered to the con- 
trolling process only after this request message (the channel 
is FIFO), and the weight of the controlling process never 
reaches zero, leaving request messages in transit. 


5.4 How to Assign a Weight 


This section describes the strategy to assign a weight which 
decreases the number of additional messages (request and 
supply messages). | 

In the worst case, that is, to assign a weight of one in 
any case, the same number of additional messages as the 
thrown processes are required, while no additional messages 
are required in the best case. If the weight carried by a 
supply message is large enough compared with the weight 
assigned to a thrown process, the weight of the subpool will 
not reach easily one after receiving a supply message. The 
weight assigned to the thrown process must be less than the 
weight of the subpool, while the weight carried by a supply 
message does not have this limitation. Using the following 
strategy, one subpool almost always needs only to send a 
request message once. 


e Assign a fixed weight (say 21°) to a thrown process if 
the weight of the subpool is more than twice of that; 
otherwise assign half of the weight of the subpool. 


e A supply message carries a very large weight (say 27°). 


On receiving a supply message, the weight of the subpool 
becomes more than 27° and it can throw a process at least 
2'° times without receiving any weight. | 

If a subpool receives a supply message before its weight 
becomes one, it need not to send a request message. A 
subpool which is created by receiving a process assigned a 
weight of 2!° can throw a process at least 10 times until its 
weight becomes one: Therefore, if the controlling process 
sends back a supply message on receiving a ready message, 
a request message is expected to be needless. 


6 Non-FIFO Communication 


In the computation model with non-first-in-first-out com- 
munication, the following situations may occur: 


e A terminated message may be delivered before a ready 
message and a request message. 


e The controlling process may receive several ready mes- 
sages (or terminated messages) before receiving a ter- 
minated message (or a ready message). 


The former may cause the weight of the controlling pro- 
cess to reach zero, leaving ready messages or request mes- 
sages in transit. On account of the latter, simply memo- 
rizing or deleting the sender PE of a ready message or a 
terminated message will not work. To cope with the situa- 
tions mentioned above, we modify the scheme as follows: 


e Assign a weight to a ready message and a request mes- 
sage (a request message will be sent when the weight 
reaches two). 


e The controlling process has a set of counters corre- 
sponding to each PE, which is incremented on receiving 
a ready message and is decremented on receiving a ter- 
minated message. 


The former change assures that the weight of the con- 
trolling process never reaches zero leaving any messages or 
processes in transit. By the latter change, if a subpool may 
exist in a PE, the value of the corresponding counter be- 
comes positive. The controlling process thus performs the 
following operations to achieve the abortion: 


(1) Sending an abort message to each PE whose 
corresponding count is positive; 


(2) Sending an abort message to the sender PE of 
a ready message received after operation (1) 
if the count corresponding to the sender PE, 
after increment, is positive. 


Since no more than one subpool can exist in one PE at a 
time, it is enough to send one abort message to one PE. 


7 Comparison 


The WTC scheme is much superior to the naive scheme 
using acknowledgement in two points. : 

First, the WTC scheme requires fewer additional mes- 
sages than in the naive scheme. The number of subpools 
created is expected to be small enough compared with the 
number of thrown processes. The WTC scheme requires 
about the same number of request messages and supply mes- 
sages as the number of the creations of subpools, while the 
naive scheme requires almost the same number of acknowl- 
edge messages as the number of thrown processes. 

Second, in the WTC scheme, each subpool can terminate 
independently, while in the naive scheme, termination of a 
subpool depends on terminations of other subpools. 
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8 Summary 


We have devised an efficient algorithm for termination de- 
tection and abortion. Its major advantages are as follows. 


e Only a few additional messages are required. 
e Each subpool can terminate independently. 
e Reuse of the process pool identifier is possible. 


The techniques described in this paper are applicable to 
many kinds of distributed processing systems. 
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Abstract 


In this paper we introduce a new synchronization primitive, 
the distributed synchronizer. This primitive, based on the notion of 
partially shared variables, suits the synchronization requirements of 
parallel algorithms executing on large, shared memory multiproces- 
sors. We consider the commonly required forms of synchronization 
in a multiprocessor: barrier, reporting, and mutual exclusion. We 
introduce the synchronization tree through an algorithm to imple- 
ment barrier synchronization. An efficient implementation of the dis- 
tributed synchronizer primitive requires a) the embedding of the syn- 
chronization tree in the processor-memory multistage interconnec- 
tion network, and b) simple hardware enhancements at the switching 
elements of the network. For n processors, this primitive imple- 
ments reporting with zero synchronization overhead and the barrier 
with a log n cycle overhead. We show that the implementation of 
the semaphore operations using the distributed synchronizer is 
bounded fair. Finally, we discuss some implementation issues and a 
few limitations of our synchronization scheme. 


1. Introduction 


It is well known that the synchronization overheads have a 
deleterious effect on the speedup of parallel algorithms. It has been 
observed that for some applications with extensive synchronization 
requirements, the speedup reaches a maximum for a small number of 
processors, and thereafter decreases [Axel86]. In this paper we intro- 
duce a new synchronization primitive, the distributed synchronizer. 
An implementation of this primitive is shown be efficient and 
bounded fair. The primitive is based on the notion of partially 
‘Shared variables, and suits the synchronization requirements of 
parallel algorithms executing on large, shared memory multiproces- 
sors. Examples of such architectures are the Cedar system [Kuck84], 
the Ultracomputer [GGKM83], and the RP3 [Pfis85]. Typically such 
a multiprocessor consists of n homogenous and autonomous process- 
ing elements (PEs). An interconnection network connects the PEs to 


a set of main memory modules such that each PE can access any 
memory module. 


2. Synchronization 
2.1. Classification 


Multiprocessors commonly require the following forms of syn- 
chronization: a) barrier, b) reporting, and c) mutual exclusion. 
(Note: Mutual exclusion is necessary in the first two cases also.) Fol- 
lowing Axelrod [Axel86], we define a synchronization barrier to be a 
logical point in the control flow of an algorithm at which all 
processes must arrive before any of them are allowed to proceed 
further. Reporting requires that all processes must arrive at a con- 
trol point before another specified process continues. 


We illustrate the need for these forms of synchronization 
through the following example: suppose the maximum of r numbers 
is to be computed on an n—PE shared memory multiprocessor. Let 
M be a shared variable initialized to -co. Each PE computes the 
(local) maximum / of the numbers that it is assigned, and updates M 


to l if | > M. To detect the completion of finding the maximum, 
each PE decrements a shared variable § (initialized to n) after it 
updates M. The decrementing of S to zero implies that the max- 
imum has been computed. Note that each PE requires exclusive 
access to update M and S. Suppose a specific PE is required to com- 
pute the maximum. It continually checks $ until S becomes zero, 
and then reads M. The form of synchronization that we have 
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23 


‘but might reduce memory contention. 


described is an example of reporting. Alternatively, assume that 
every PE must obtain the maximum for its subsequent computa- 
tions. Every PE checks S until S becomes zero and then reads M. 
This form of synchronization where values are reported (the value of 
S is updated by all the PEs) and then communicated to all the PEs 
is an example of barrier synchronization. The barrier is said to be 
complete when every PE knows that S is zero. The synchronization 
overheads (owing to the continual checking of S and the exclusive 
accesses to M and S) cause serious performance degradation in mul- 
tiprocessors, especially as the number of PEs increases [Axel86, 
Jaya87a, Pf{No85]. To reduce these overheads, various schemes have 
been proposed [GGKM83, GLee86, Jaya87a, P{No85, YeTL86]. A 
particularly elegant method, called combining, has been proposed for 
the Ultracomputer [GGKM83], and is being considered for imple- 
mentation on the RP3 [PfNo85]. Combining, which detects and 
combines memory requests to the same location, requires expensive 
hardware. Furthermore, simulations show that combining may not 
be required, or effective in many cases [GLee86]. In the Cedar sys- 
tem, synchronization overheads are reduced by providing additional 
hardware at each memory module to perform simple synchronization 
related computations [ZhYe84]. Both combining and the Cedar 
scheme require the locking of shared variables in order to access 
them with mutual exclusion. Consequently, algorithms using these 
schemes generate wasted memory traffic due to busy waiting. Our 
synchronization technique does not require the locking of globally. 
shared variables. 


2.2. Walk—in Walk—out Scheme 


Rather than force all PEs to access a single variable, we could 
allow a fixed number of PEs, m (2<m<n) to access each variable. 
(We assume n to be a power of m throughout the paper). Such a 
scheme increases the number of synchronization variables required 
In particular, we could 
arrange the synchonization variables in the form of a synchroniza- 
tion tree as shown in Figure 2.1. All the variables in the tree are ini- 
tialized to m. If a PE decrements a variable and finds the resulting 
value to be zero, then it proceeds to the next higher level of the tree. 
Otherwise the PE waits for the variable to assume a special value, 
say, -1. The last arriving PE decrements the root to zero and sets it 
to -1. At this time the walk-in ends. When a PE finds the variable 
on which it is waiting to be —1, it communicates this information to 
the next lower level. This procedure is repeated recursively. The 
completion time would be the time at which the last PE at the 
lowest level finds its variable to be -1. At this time the walk-out 
ends. This algorithm is similar to the software combining algorithm 
proposed in [YeTL86]. 


. 


4 Bes au An it — 
L2ee m L Qe m 7 4 Zs MH Go Zeer mM 


Figure 2.1 Synchronization Tree for the Walk-in Walk—out Scheme. 


Notation: In the figures, the PEs and the nodes of the syn- 
chronization tree are numbered from left to right. We number the 
PEs from 1 to n. The PE with number ¢ is denoted by PE;. Upper 
and lower case names represent shared and local variables respec- 
tively. Let h = log,,n, the number of stages in the interconnection 

“network. 


3. The New Primitive 


3.1. Partially Shared Variables 


The Walk—in Walk-—out scheme has the advantage of requiring 
only a fixed number of PEs m to access a node of the synchroniza- 
tion tree. Furthermore, a node at level i need be shared only by m' 
PEs (note: some m out of these m' PEs access the node). This 
observation suggests that the nodes of the synchronization tree may 
be placed in a memory hierarchy according to the degree to which 
the nodes are shared. By a memory hierarchy we mean a set of par- 
tially shared memories such that a variable at the level ¢ is shared 
by more number of PEs than a variable at the level 7 (j <i). Vari- 
ables could be placed at an appropriate level in the memory hierar- 
chy, thus eliminating expensive global memory trips to access shared 
variables that need to be only partially shared. The multi-memory 
hierarchy effectively distributes the "von Neumann bottleneck” and 
consequently achieves a better performance. The partial sharing of 
variables would lead to an increased complexity in the hardware and 
memory management. 


3.2. Distributed Synchronizer 


Consider any multistage interconnection network with the full 
access capability (which means that any input terminal of the net- 
work can reach any output terminal in one pass through the net- 
work) and the unique path property (which means that each input 
terminal has exactly one path through the network to reach any par- 
ticular output terminal). Feng [Feng81] surveys such interconnec- 
tion networks. Examples of these networks include the Omega, the 
baseline, the banyan, and the indirect binary n-cube. They are typ- 
ically designed using log,,n stages of m Xm switching elements. We 
embed the synchronization tree into the interconnection network 
with the leaves placed in the switching elements at the first stage and 
the root placed in a switching element at the last stage of the net- 
work. The connections between stages that lead to the switching ele- 


ment containing the root correspond to the branches of the tree. - 


Figure 3.1 shows an embedding of the binary synchronization tree 
(with eight leaves) into an 8—input 22 Omega network. The heavy 
lines in the figure represent branches of the tree. Each stage of the 
interconnection network corresponds to an intermediate level in the 
memory hierarchy mentioned earlier, and each node in a switching 
element to a partially shared variable. Mutual exclusion must be 
guaranteed among the m children accessing their parent node. We 
_ explain how this is achieved for each synchronization operation in 
the relevant sections. The name distributed synchronizer refers to 
the embedded synchronization tree together with the operations 
| defined on the tree. 


4. Reporting and Barrier — 


_A switching element in the network is a m xm bidirectional 
router. On its PE side each switching element has m input ports 


PI,, ..., Pl, and m output ports PO, ..., PO,,. On its memory side 


each switching element has m input ports MI, ..., MI, and m out- 
put ports MO, ...,MO,,. An input port PJ; on the PE side gets 
connected to an output port MO; on the main memory side during a 
message transfer (message originating at a PE). Similarly, an input 
port MI; gets connected to an output port PO; during a message 
transfer (message originating at the main memory). Additionally, 
each switching element has a modulo m counter, a decoder, and 
some combinational logic. 


For simplicity, we make the following assumptions: 
(A1) All PEs participate in the synchronization operation. 
(A2) At any instant at most one concurrent set of synchronization 
operations is being executed. 
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Fig. 3.1 Embedding a Synchronization Tree into an 8X8 Omega Network. 
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4.1. Reporting 


To perform the reporting. operation, each PE executes the 
Rep(S) instruction. S is a flag variable in global memory which is 
initially reset. S is set after all PEs have reported. The semantics of 
Rep(S) are shown in Figure 4.1, and are informally described below: 
A PE executing Rep(S) enters the synchronization tree at its 
appropriate leaf node. It decrements the counter at the switching 
element. If it finds the value of counter to be zero, then it is the last 
arriving PE for this synchronization operation at the node. It reini-— 
tializes the counter to m and proceeds to the node in the next higher 
level of the tree. Otherwise the instruction completes. This pro- 
cedure is repeated recursively. The PE that decrements the root to 
zero then sets S, at which time the synchronization operation com- 
pletes. This scheme has the following advantages: a) The shared 
variable S need not be locked. To ensure mutual exclusion, tradi- 
tional multiprocessors use locking, which incurs an unnecessary glo- 
bal memory access whenever a PE tries to lock a variable that has 
been already locked by some other PE. The decrementing of the 
counter at a node has to be performed atomically. But this is easily - 
achieved in the hardware. b) There are no synchronization over- 
heads except for the constant time to execute the Rep(9) instruction. 

Note that the reinitialization of the synchronization tree is 


achieved in a distributed manner since the last arriving PE at each 
node sets the counter to its initial value. 


Algorithm Rep (S, PE#, level#); | 
/*The counter variables C;; are initialized to m, and S is reset */ 
begin 


2. Creveltj = Crevel,j == 1; 
3. if Creeeg,j = 0 then /* last arriving PE at some node */ 
4. if level# = h then 


begin 
5. Creveist,j 2= ™; /* reinitialize */ 
6. set S; /* reporting is done */ 
end 
else /* last arriving PE at a non—root node */ 
begin 
7. Creveit,j = m; /* reinitialize */ 
8. Rep(S, PE#, (level# +1)); /* to next higher level */ 
end 
end Rep; 


PE; executes Rep(S, 7, 1). 
Figure 4.1 Semantics of the Rep(S) Instruction. 


The following cases: a) requirement of d (d>1) concurrent 
synchronization operations, and b) requirement of reporting among 
only n, (n;<n) PEs, are discussed in [Jaya87b]. 


4.2. Barrier 


We will assume that (Al) and (A2) hold. Each PE performs 
the barrier synchronization by executing the instruction Barrier(v); 
v refers to a flag, initially reset, that is present in each PE. The 
semantics of the instruction, shown in Figure 4.2, are essentially an 
implementation of the Walk-in Walk—out scheme in hardware. The 


hardware requirements are the same as for reporting, except for the 


Algorithm Barrier (v, PE#, level#); 
/*The counter variables C;; are initialized to m, and » is reset */ 
begin 


: PEF: 
fe. Gus ee 
m level# 
2. Crevel#t,j = Crevel#t,j = 13 
3. if Cheez; = 0 then/* last arriving PE at some node */ 
A, if level# = h then/* last arriving PE at barrier */ 
5. Walkout (v, level# , 7) /* begin walk-out */ 


else /* last arriving PE at a node which is not the root */ 


6. Barrier(v, PE#, (level#: +1)); /* to next higher level */ 
end Barrier; 


procedure Walkout(v, level#, k); 
begin 
1. if (level# = 1)then 


2. set the flag in each PE 
else 
3. for | := (k—1)xm+1tokxm do 
begin 
4. place return request in each PO; port; 
5. Walkout(v, (level# —1), 1); 
end; 
end Walkout; 


PE, executes Barrier(v, 1, 1). 


Figure 4.2 Semantics of the Barrier(v) Instruction. 


extra bit flag in each PE. If a request finds the counter at a stage to 


be zero, it proceeds to the next stage. Otherwise the requesting PE 


waits for its flag v to be set. The request that decrements the 
counter at the root to zero is the last process to arrive at the barrier. 
It signals the completion of the barrier to the (m —1) requests wait- 
ing at the root by placing a return request containing the address of 
the instruction, at each of the m PO ports of the root’s switching 
element. The procedure of "walking out" is recursively repeated till 
the leaves are reached. The final step consists in setting the v flag of 
each PE, at which time every PE comes to know that the barrier has 
been completed. 

The delay between the time at which the last process finishes 
(i.e., the time at which the root is decremented to zero), and the time 
at which every PE knows that the barrier is complete is just log,,n 


cycles (assuming no network conflicts). The synchronization is- 


accomplished with no wasted memory accesses, and without the need 
for any shared globally variables. 


Each PE could "busy wait" on the flag v till it is set, or could 
context switch to a different process after executing the Barrier(v) 
instruction. In the latter case, the PE may be interrupted to signal 
the completion of the barrier. Observe that, unlike normal busy 
waiting, PEs in the former case do not generate wasteful memory 
_ traffic, which can cause a degradation in performance. 


Assumptions (Al) and (A2) may be relaxed by ptoviding extra 
hardware. See [Jaya87b] for details. 


25 


5. Semaphore Operations 


A semaphore is a shared variable S$ together with the atomic 
operations P(S) and V(S) defined as follows: P(S): 
<whileS<0Odo skip; S:=S-1>; V(S):<S:=S+1>; 
(Instructions within the angle brackets are executed atomically). 
The variable S$ is initialized to one. 


Most implementations of P and V require the continual check- 
ing of S, as implied by the while loop, before the P operation is suc- 
cessful. An alternative to this busy waiting on S on an unsuccessful 
P operation is to context switch to a different process. Context 
switching, however, requires intervention by the operating system, 
and consequently large overheads. Further, self scheduling and 
guided self scheduling algorithms [PoKu87, TaYe86|, which are used 
in a number of application programs in a multiprocessor system, 
require some form of busy waiting. 


We next describe P and V implemented with the new primi- 
tive. For simplicity, we deal with binary (m=2) synchronization 
trees. We will also assume that assumption (A2) holds. Not all PEs 
need participate in the synchronization, however. Each node of the 
tree is a bit variable which is initially zero (reset). 


5.1. The DSP Instruction 


To implement the P operation, each PE executes the 
Distributed Synchronizer P (DSP) instruction, the semantics of 
which are shown in Figure 5.1. The P operation for a process 
belonging to a PE is complete if it can set the node at the root of the 
synchronization tree. If a process finds that the root is already set, 
then another process is in the critical region. 


In the following discussion, when we talk of a node at level 1, it 
refers to the node at level i that lies between the PE of interest and 
the root. Note that by the unique path property, there is only one 
path from a PE to the root, and hence there is a single node on the 
path at any level. A request contains the address of the instruction 
and the address of the PE where it originates. Consider a request 
which has reached level 7 successfully by setting the node at level 7. 
If the node at level 1 +1 has not been already set (by an earlier pro- 
cess), then the node at level 1+1 is set and the node at level 7 is 
reset atomically. Otherwise, the process waits at level 1. The atomic 
operation is accomplished easily, without starvation, in hardware 
[Jaya87b]. Nodes with their bit variables set represent processes 
waiting to enter their critical regions. We next describe a schedule 
to wake up waiting processes. 


5.2. The DSV Instruction 


To describe the Distributed Synchronizer V (DSV) operation, 
we use another property that the class of interconnection networks 
described in Section 3.2 share: they can employ simple and distri- 
buted routing algorithms. Recall that h =logon, and _ let 
@,Q,_,; *** @,@q be the binary expansion of an integer 7. Further, 
let PE; have a request that is waiting at the node at level 7 to enter 


Algorithm DSP (j, k); 
/* Let FL;1, FL;, ..., FL;, be the partially shared 


nodes on the path from PE; to the root. FL;o is a bit variable 
in PE; ry 


begin 
1. ifk =h then enter critical region; 
else 
2. if FL; 441=0 then 
begin 
3. <FL 44.:=1; if k #0 then FL, := 0;> /* atomically 


done in the hardware */ 
4, DSP (j, k +1); 
end; 


end DSP; 
PE, executes DSP(j, 0) to enter its critical region. 
Figure 5.1 Semantics of the DSP Instfuction. 


the critical region. The request stores the values a; and a;,, as part 
of its information at the node. The semantics of DSV are shown in 
Figure 5.2, and informally explained below. Let a process belonging 
to PE, finish, and let k,k,_1 °° ° k,kq be the binary expansion of k. 
By virtue of the semantics of DSP, if there are processes waiting at 
level i, then there are processes waiting at level (¢+1) also. Hence 
PE, has to check only the children at the root of the synchronization 
tree. If PE, is at a leaf of the left (right) subtree, then the switching 
element logic at the root enables a waiting process, if any, at the 
right (left) subtree (i.e., a PE whose address in the (h +1)th position 
is k,, the complement of k,). If there is no such process waiting, 
then a process, if any, belonging to its own subtree is enabled. The 
awakened’ process repeats this procedure recursively using the 
appropriate position of the binary expansion of its PE number. 
Thus the process belonging to PE;, when awakened at a node at 
level t, decides which new process should occupy the vacant node by 
‘examining j; and the ith positions of the PE numbers of the 
processes at the children of the node. 


Algorithm DSV (j, k); 

/* Let FL;,, FL;o, ..., FL;, be the partially shared 

nodes on the path from PE; to the root. FL;o is a bit variable 
in PE;. Let a, *** a, °° * ag be the 

binary expansion of j */ 


begin 
1 if FL, ...5,.,e-1 = 1 then 
begin 
2. Fg, - ++ a, +++ og, = 1s 
3. transfer process from te ee +++ a,k—1 10 next higher level; 
4. FL, ree Gps ag k—-1 O= 0; | 
5. DSV(newPE, k —1); /* newPE is the address of the PE on 


which the new process runs */ 


end 
else 
6. if FL, ...4,.--4,s—-1 = 1 then 
begin 
7. 1 = 1; 
8. transfer process from Flog... a, +++ ayh—t to next higher level; 
9. Fig... +9, ++: ay,k—1 = 03 | 
10. DSV(newPE, k —1); /* newPE is the address of the PE on 
which the new process runs */ 
end 
end DSV; 


PE; executes DSV(j, h) after leaving its critical region. 
Figure 5.2 Semantics of the DSV Instruction. 


Example: Consider an 8—PE system with the synchronization 
tree as shown in Figure 5.3a. All the nodes in the tree are initially 
reset. The timing figure of Figure 5.3b shows a particular set of 
requests to the critical section and the order in which they are 
honored. In that figure, r stands for a request to enter the critical 
region, h for a request honored by the scheduler, and ec for the com- 
pletion of the use of the critical region. The prefix digit denotes the 
PE number. Observe that, for this example, PEs 5 and 7 get ser- 
viced out of turn. The order in which the requests are honored is 
readily seen by following the informal explanations given for DSP 


and DSV. 


The hardware requirements at a switching element are a 3—bit 
‘register and simple logic (to perform the logical OR and the reset 
operations atomically). If the first bit of the 3—bit register at level 7 
is set, then some process belonging to PE, is waiting at this node. 
The other two bits then store the ith and the (i + 1)st position of the 
binary expansion of k. . 


The following cases: a) relaxing assumption (A2), and b) the 
case of m = 2! (1 >1), are considered in [Jaya87b]. 
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Figure 5.3a Illustration of DSP and DSV. 
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Figure 5.3b Illustration of DSP and DSV (contd). 


5.3. Bounded Fairness 


Many notions of fairness for concurrent systems have been pro- 
posed [Fran86]. In the context of (mutually exclusive) access to a 
shared resource (such as a critical region), fairness is synonymous 
with "starvation freedom". The P and V operations are used to 
ensure mutual exclusion. Many implementations of P and V require. 
busy waiting on the semaphore variable, which, in turn, may lead to. 
starvation. We show that DSP and DSV not only are starvation 
free, but satisfy the stronger notion of bounded fairness [JaDe86]. 
For our purposes, we define bounded fairness within the context of a 
scheduler (which may be centralized or distributed). A scheduler q is 
k—bounded fair if a process p wishing to enter its critical region is 
guaranteed to do so at one of the next k& times that q¢ schedules either 
p or a a process arriving after p (k is referred to as the bounded fair- 
ness number). For example, a FIFO scheduler is 1-bounded fair. 


We first make the following observations before we prove the 
bounded fairness of DSP and DSV. 


Observation 5.1: The semantics of DSP ensure that a process 
wishing to enter its critical region either does so, or traverses to a 
level of the synchronization tree until it can no longer proceed. 


Observation 5.2: From the above observation, we can infer that it 
is sufficient for a process leaving the critical region to activate 
another process to enter its critical region by examining the nodes at 
the children of the root of the synchronization tree. 


Observation 5.3: Further, by virtue of the semantics of DSV and 
the above observations, a process that "vacates" an internal node (to 
travel to the next higher level of the synchronization tree) can also 
decide which new process should occupy the vacant node by examin- 
ing the children of the node. 


Theorem 5.1: For an 
(2n — (1+ logan ))—bounded fair. 


n-PE multiprocessor, DSV is 
Proof: Let us append the n PEs as leaves of the synchronization 
tree and call the result the extended synchronization tree. Consider 
such a tree shown in Figure 5.4. Suppose a process belonging to PE; 
sends a request r to enter the critical region. In the case of a conflict 
among the processes in the m (m=2) subtrees at a node, the 
scheduling policy (i.e., the semantics of DSV) chooses the subtree 
which has not been most recently chosen. From observation 5.3, this 
selection can indeed be done by examining the children at the root of 
the subtree of interest. Hence, in the worst case, the request r 
accesses the root (i.e., enters its critical region) in at most as many 
tries (each entry into a critical region by any process is a try) as the 
number of nodes in the subtrees JT, and Tp, i.e., 2(n—1). The fol- 
lowing cases arise: 


Case 1: If r accesses the root in at most (2n —(1+loggn)) tries, then 
the condition for bounded fairness is trivially satisfied. 


ROOT 


|<< n/2 
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Figure 5.4 Extended Synchronization Tree for Proving Bounded Fairness. 


Case 2: If r accesses the root in exactly 2(n —1) tries, then at every 
node on the path from PE; to the root, a PE from the subtree not 
belonging to the subtree containing PE; was chosen. By induction it 
is clearly seen that at every node on the path from PE; to the root, 
PE; should have a request sent before r. Hence the number of 


requests in the tree before r is at least loge (i.e., the height of the 


subtree whose leaves are one level higher than the leaves of Tp). 
Since the request r can be satisfied in 2(n —1) tries, and there were 


at least loge requests before r, k, the bounded fairness number is 
2(n —1) — logy > = (2n — (1+ logan). 


Case 3 (Sketch): If r accesses the root in (2n—(1+logon)) <i< 
2(n —1) tries, then by an argument similar to that used in case 2, it 
can be shown that there are at least (¢ —(2n —(1+log.n))) nodes on 
the path between PE; and the root that contain requests sent from 
PE; before r. The bounded fairness result immediately follows. & 


Note: The bounded fairness result may be extended to m-ary syn- 

chronization trees. In [Jaya87b] it is shown in such a case that DSV 
nm — 

(———— 


is = — log,, )-bounded fair. 


6. Further Remarks 


Our implementation of the synchronization operations has a 
few limitations. They are the following: 
(1) Number of Synchronization Operations: The present imple- 
mentation of the distributed synchronizer allows only a limited 
number of concurrent synchronization operations. 
(2) Restricted Schedules: In the case that barrier and reporting 
are required only among n,(n,;<n) PEs, the root of the syn- 
chronization tree may be placed at an intermediate stage of the 
interconnection network. In such a case, not every subset of 
PEs could use the distributed synchronizer to perform these 
operations. This observation translates to a restriction on pro- 
cessor scheduling. 
(3) Process Migration: In our barrier and reporting schemes, 
processes may not migrate across PEs. This is not a serious 
limitation (except when fault tolerance is to be provided), 
since, for efficiency reasons, most multiprocessor operating sys- 
tems do not allow processes to migrate. 
These limitations may be partially overcome. See [Jaya87b] 
for details. 


7. Conclusion 


Synchronization and communication overheads become impor- 
tant performance criteria in large multiprocessors. A number of 
researchers have recognized that synchronization requirements lead 
to serious performance degradation in such systems. In this paper 
_ we have introduced a new synchronization primitive based on the 
notion of partially shared variables. Using this primitive, we have 
shown that the commonly required synchronization operations may 
_be efficiently performed with practically no overheads. Though this 
implementation of the distributed synchronizer does not wholly 
remove the need for conventional synchronization operations based 


2] 


- [Axel86] 


on the locking of shared variables, our paper shows a promising way 
to distribute synchronization operations and to exploit the power of 
partially shared variables. An interesting extension of this work is 
to make the distributed synchronizer fault tolerant. An important 
research area is the feasibility of architectures based on the notion of 
hierarchical memories. 
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Graph-based partitioning of matrix algorithms 
for systolic arrays: application to transitive closure 
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Abstract. We propose a technique to partition algorithms for 
execution in systolic arrays, based on transformations to the 
dependency graph of algorithms. We illustrate this method 
through its application to the computation of transitive clo- 
sure of a directed graph. We derive linear and two-dimensional 
structures for such algorithm that exhibit maximal utilization, 
no overhead due to partitioning and simple control. In the 
process, we obtain a graph suitable for an array for fixed—size 
problems that exhibits better characteristics than arrays pre- 
viously proposed for this algorithm. Our method also allows 
evaluating trade—offs among implementations. 


Introduction 


The implementation of matrix algorithms as collections of reg- 
ularly connected processing elements (arrays of PEs) has been 
extensively studied lately. Many applications require process- 
ing large matrices for which it is not feasible to build an ar- 
ray of the required size, while others require solving problems 
of variable size using the same array. In such cases, it be- 
comes necessary to decompose the problem into sub—problems 
so that the sub-problems fit into a target array. This is known 
as partitioning the algorithm and has been studied by many 
researchers [1]-[5]. | 


In this paper, we summarize a partitioning technique based 
on the dependency graph of algorithms. A complete description 
of the technique can be found in [6]. This is a transformational 
approach, that uses a fully—parallel dependency graph as.the 
description of the algorithm. Such a graph is first transformed 
to remove properties not desirable for an implementation (i.e., 
data broadcasting, bi-directional data flow) and converted into 
a graph suitable for partitioning (i.e., with simple communica- 
tion requirements). The resulting graph is mapped onto the 
target array. The transformations are performed taking into 
account issues such as I/O bandwidth, throughput, delay, and 
utilization of PEs. We illustrate the technique through its ap- 
plication to the design of arrays for partitioned computation of 
the transitive closure of a directed graph. We derive and evalu- 
ate linear and two—dimensional structures to compute such al- 
gorithm. These arrays exhibit maximal utilization, no overhead 
and simple control. In addition, we show that an intermediate 
graph used by the methodology is suitable for implementation 
of fixed-—size arrays for transitive closure, with better charac- 
teristics than arrays previously proposed for such computation. 


We have applied our partitioning technique to several 
algorithms for matrix computations, among them LU- 
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decomposition, QR-decomposition, and Faddeev algorithm [7]. 
Our results show that the graphical nature of our approach 
makes it easier to use than methodologies based on math- 
ematical expressions proposed in the literature. Moreover, 
the method allows evaluating trade-offs between linear and 
two-dimensional arrays for partitioned execution of algorithms. 
This technique is an extension to one for the design of arrays 
for fixed—size problems that we have previously proposed [8,9]. 


Partitioning the computation of transitive closure of a di- 
rected graph has been recently addressed by Niufiez and Tor- 
ralba [10]. They propose an algorithm and partition it through 
decomposition into a block—algorithm. Although they do 
not address the details of an implementation, their algorithm 
requires rather complex control to chain the different sub— 
problems. 


Graph-based partitioning 


Partitioning consists of mapping the computation of an algo- 
rithm with large-size data onto an array smaller than the size 
of the data. Three basic approaches have been proposed to 
achieve such mapping: 


e coalescing [1,5] 
e cut—and-pile [1] 
e decomposition into subalgorithms [1] 


The relative merits of these approaches are discussed in [6]. 


We summarize here a partitioning technique based on the 
dependency graph of algorithms that uses the cut—and-pile ap- 
proach due to its generality and smaller memory requirements. 
A complete description of such technique is given in [6]. This 
partitioning procedure is as follows: 


1. Transform the dependency graph to remove properties 
undesirable for an implementation, such as data broad- 
casting or bi-directional data flow. Procedures for these 
purposes have been presented in [8,9]. 


2. Transform the graph obtained in (1) into a new graph, 
which we call the G-graph, by collapsing groups of nodes 
into new nodes (G-nodes). The objective of this transfor- 
mation is to obtain a graph more suitable for partitioning, 
that is, with simple communication requirements. 


Criteria to perform the selection of primitive nodes com- 
posing a G-node are reported in [6]. 


3. Map G-—nodes to a target array with m cells by schedul- 
ing sets of m neighbor G-nodes (a G-set) for concurrent 
computation. G-sets scheduled successively are executed 
in overlapped (pipelined) manner in the array. The se- 
lection of G-sets depends on the structure of the target 


array. In addition, for maximal utilization, all nodes in a 
G-set should have the same computation time. 


The G-graph obtained with our procedure can be directly 
used to implement an array for a fixed-size problem. However, 
since the G-graph might be composed of nodes with different 
computation time, its direct implementation could lead to low 
utilization of cells. 


Partitioned Computation of Transitive Closure 


We present now the application of the proposed partitioning 
technique to the design of arrays to compute transitive closure 
of a directed graph. We first describe briefly the algorithm and 
then apply the three-step procedure indicated above. 


The transitive closure problem 


A directed graph G is a tuple G(V,E), where V is the set of 
nodes and E is the set of edges in the graph. G can be described 
by the adjacency matrix A, where element aj; = 1 if there is 
an edge from node ¢ to node 9 or if 1 = j, otherwise aj; = 0. A 
directed graph Gt+(V, Et) is called the transitive closure of G 
if it has the same vertex set as G and has an edge from node 
v to node w if and only if there is a path of length zero or 
more from v to w in G. Gt can be described by the adjacency 
matriz At, 


The computation of the transitive closure of a graph is usu- 
ally performed by Warshall’s algorithm [11]. Given the adja- 
cency matrix A, then At is obtained through the application 
of the following recurrence: 


For k fromiton 
For i fromiton 
For 93 fromiton 
k 
ais ‘®@ (xt 1 @ yy ty 


In this expression, X° = A, At = X” and the operators ® 
and ® stand for binary—OR and binary—AND, respectively. 


The fully—parallel dependency graph [8] of the transitive clo- 
sure algorithm is shown in Figure 1, for a problem of size n = 4 
(i.e., to compute the transitive closure of a directed graph with 
four nodes). The graph has four levels, where each level corre- 
sponds to one iteration of the outermost index im the algorithm 
above. 


Some evaluations of the eon in the algorithm above 
do not change the corresponding at. . In particular, the value 
of a diagonal element in the adiacenes matrix is always 1, be- 
cause a node in the directed graph is always adjacent to itself. 
In addition, for k = 1 or k = 37 one of the two operands in 
oe —l @ oa becomes oe, k ~! which is a diagonal element and 
ae always equal to 1. Consequently, the result from the @ 
operation is equal to the second Bperand the @ operator gets 
two identical operands (x5 —! or rey ) and the result is that 
same operand. These properties can be utilized to simplify the 
design of arrays to compute the transitive closure and to reduce 
the complexity of the algorithm since fewer operations need to 
be performed. Nodes surrounded by dashed areas in Figure 1 
correspond to superfluous nodes (i.e., they do not need to be 
computed). 
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Figure 1: Fully—parallel dependence graph of transitive closure 


Arrays for partitioned computation of transitive 
closure 


We apply now our partitioning procedure to the computation of 
the transitive closure of a directed graph. The fully—parallel de- 
pendency graph shown in Figure 1 is not suitable for implemen- 
tation, because it exhibits broadcasting of data and complex 
communication requirements. We address these issues first, ac- 
cording to the procedure described previously. 


There are two types of data broadcasting in Figure 1. At 
the k-th level of the graph, data elements from row k of the 
matrix are broadcasted to all other rows. Moreover, the k-th 
element of each row of the matrix is broadcasted to all other 
elements within each row. Because of the varying pattern of 
broadcasting, other researchers have considered transitive clo- 
sure an irregular algorithm [13]. 


We transform the graph replacing broadcasting by pipelin- 
ing, as suggested in [8]. Given that the k-th row of the adja- 
cency matrix remains unchanged at the k-th level of the graph, 
we remove the nodes corresponding to updating those values 
and draw the flow of data for such row horizontally and inter- 
secting with the flow of data of the other rows of the matrix. 
Such modification is shown in Figure 2. Data which is evalu- 
ated at each level of the graph flows vertically, while the data 
element broadcasted within each row of the matrix flows diag- 
onally through the row. 


Because of the varying source of broadcasting, the trans- 
formed graph in Figure 2 exhibits bi-directional flow of data. 
However, this bi-directional flow can be eliminated by mov- 
ing nodes dependent on the broadcasted data to one side of 
the source of broadcasting. We have described such transfor- . 
mation as an approach to solve this problem in dependence 
graphs [9]. In this case, the transformation is applied in two 
steps: first, nodes to the left of sources of horizontal broad- 
casting are flipped to the right end of each row of the graph. 
Then, nodes above sources of diagonal broadcasting are flipped 
to the bottom end of such diagonals. In addition, delay nodes 
are placed at the boundaries of the graph with the same depen- 
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Figure 3: Transformed transitive closure dependence graph 


dency structure that dominates the graph, as proposed in [9]. 
The resulting graph is shown in Figure 3. 


Once properties of the original dependency graph not suit- 
able for partitioning have been eliminated, we apply the re- 
maining of our partitioning procedure. We first transform the 
graph into a G-graph by selecting sets of primitive nodes in 
such a way as to reduce communication requirements and ob- 
tain G—nodes with the same computation time. In this case, 
diagonal paths are a good alternative for grouping, because 
nodes in such paths communicate among themselves in a repet- 
itive manner and all paths have the same number of primitive 
nodes. The result of performing such grouping is the G-graph 
shown in Figure 4. 


As an aside, Figure 4 is suitable for direct implementation as 
an array for fixed~—size problems. Such an array achieves max- 
imum utilization because all G—nodes have the same computa- 
tion time and the algorithm is computed in pipelined manner 
in the array. Throughput is 1/n because the computation time 
of G-nodes is n cycles. Successive instances of the algorithm 
can be chained without restrictions. This array is simpler than 
the one proposed in [13] because it has a single communication 
path between cells and no control complexity. Furthermore, 
data transfers and computations are overlapped while the ar- 
ray proposed in (13] requires that “data be first loaded in the 
nodes and then reused for a period of n cycles” so that “certain 
control is required in the systolic array.” 
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Figure 5: Mapping G-—graph onto a linear array 


Another advantage of the array for fixed-—size problems de- 
scribed above relies on the simplicity of its derivation through 
the graph—based methodology. This is in contrast with the 
scheme in [13], which uses a rather complex mathematical ap- 
proach. Furthermore, the G—graph in Figure 4 can be collapsed 
into a linear structure by grouping each horizontal path into a 
single node. The resulting graph can be directly mapped onto 
a linear array with throughput [n(n + 1)]~! and all cells fully 
utilized. 


Arrays to compute the transitive closure in partitioned mode 
can be derived directly from the G—graph in Figure 4, as we 
describe next. : 


Linear array 


Let’s assume that we want to partition the computation of 
transitive closure of a directed graph with n nodes so that it 
fits in a linear structure with m cells, where m € n. We 
map G-sets from the transformed graph onto a linear array 
by selecting G-sets of m G-nodes from horizontal paths, as 
shown in Figure 5 for m = 4. Intermediate results from G-sets 
are saved in external memories. Such data is available at the 
boundary of the set, so that saving it in external memories is 
straight-forward. 


Figure 6: Mapping G-graph onto two—dimensional array 


The structure resulting from this approach enjoys maximal 
utilization because all G—nodes executed concurrently have the 
same computation time, except when executing boundary sets 
in some horizontal paths that might not use all cells in the 
array. The number of connections to external memories is m+1. 


Two-dimensional array 


Mapping the G-graph for execution in a two-dimensional 
structure with m cells requires to simulate a triangular array 
and a square array, because those are the major components 
of the G-graph. Both requirements can be fulfilled in a square 
array. G-sets are selected as square blocks of ./m by /m 
nodes, excepting sets at the boundaries of the G-graph which 
are composed of triangular blocks of G-nodes. As in the linear 
case, intermediate results are saved in external memories. The 
structure resulting from this approach is shown in Figure 6 for 
m = 4. Utilization of this array is maximal, except when exe- 
cuting boundary sets because such sets do not use all cells in 
the array. The number of connections to external memories is 


2./m. 


To use the arrays obtained above it is necessary to schedule 
the execution of G-sets. Such scheduling is discuss in detail 
in [6] and it is shown that linear and two-dimensional arrays 
require the same I/O bandwidth from the host. 


Conclusions 


We have proposed a technique to partition algorithms for ex- 
ecution in arrays, based on dependency graphs of algorithms. 
We described the application of such technique to the compu- 
tation of transitive closure of a directed graph. Through this 
example, we have shown that the approach is general and pow- 
erful. This technique is suitable for a class of important ma- 
trix algorithms, produces implementations with maximal uti- 
lization of cells and no overhead due to partitioning, and al- 
lows evaluating trade-offs between linear and two—dimensional 
structures. Moreover, this graph—based approach is simpler to 
use than schemes based on mathematical expressions. 


We derived linear and two-dimensional arrays for parti- 
tioned ‘computation of transitive closure. In the process, we 
have obtained a dependence graph which is suitable for im- 
plementation of a fixed-—size array for transitive closure, with 


od 


better characteristics than structures previously proposed for 
this algorithm. 


In [6], we describe other issues in partitioning algorithms, in 
particular trade-offs between linear and two—dimensional struc- 
tures. We show there that, with the same number of cells, lin- 
ear arrays are simpler, have the same throughput and require 
the same I/O bandwidth from the host than two-dimensional 
ones, and might exhibit better utilization. Moreover, linear 
arrays are more advantageous than two—dimensional ones be- 
cause they are better suited to incorporate fault-tolerant ca- 
pabilities. Consequently, we conclude that linear arrays offer 
better performance and implementation than two-dimensional 
arrays for partitioned execution of algorithms. 
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Abstract 


Warp is a systolic computer developed by CMU and 
manufactured by GE. The machine has 10 or more 
linearly connected cells. Each cell in the array is capable 
of performing 10 million floating point operations per 
second (10 MFLOPS). The 10-cell array can achieve a 
peak performance of 100 MFLOPS. This paper describes 
parallel iterative sparse linear system solvers developed 
for the Warp systolic computer. For general sparse linear 
systems, Warp achieves 12.5 MFLOPS in sparse matrix 
vector multiplication, which competes with supercom- 
puters such as Cray-1S and Cyber-205. We implemented 
the general sparse linear system solver IC-PCCG 
(Incomplete Choleski Pre Conditioned Conjugate 
Gradient method) using the sparse matrix vector mul- 
tiplication kernel. The solver was exercised on sparse 
linear systems derived from production finite element ap- 
plications. Speedups of more than 100 over the VAX/780 
with floating point accelerator are achieved. For solving 
regular sparse linear systems, domain partitioning is used 
to speedup solving finite difference equations on a regular 
mesh. For a model problem of Laplace’s equation on a 
square mesh of 500 by 500 unknowns, Warp is able to 
achieve 14.6 MFLOPS using the generic SOR relaxation 
scheme and 49.4 MFLOPS using the 2-color SOR relaxa- 
tion scheme. 


1. Introduction 

Large sparse linear systems of order 10* to 10° frequently 
arise in large scale scientific and engineering analysis such as 
computational fluid dynamics, structural mechanics, 
electronic device simulation and electric magnet field 
analysis. Direct methods designed for solving dense linear 
systems such as LU and Choleski decomposition are imprac- 
tical for solving very large sparse systems, because very large 
storage is required. Driven by demands from applications, 
extensive efforts have been invested in the search for practical 
solvers for large sparse linear systems. There are two ap- 
proaches. One is to pick an appropriate direct method and 
adapt it to exploit the sparsity of linear systems. Typical 
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adaptation strategies involve the intelligent use of data struc- 
ture and special pivoting strategies that minimize fill-in of the 
coefficient matrix [5, 6]. In contrast to the direct methods are 
the iterative methods. These methods start with an initial 
guess to the solution and generate a sequence of successively 
improved solutions until it converges to the desired solution 
within the accepted tolerance. Iterative methods are much 
more efficient for very large sparse systems because the coef- 
ficient matrix is not decomposed and remains unchanged 
throughout iterations, therefore no fill-in is created. 


In this paper, several iterative sparse linear system solvers 
on the CMU Warp machine are described. We first give a 
brief review of the Warp machine and its architectural 
Strength in supporting sparse matrix computations. Secondly, 
we consider the crucial kernels used in solving general sparse 
linear systems, sparse matrix vector multiplication and sparse 
triangular system solving. Implementation of these kernels on 
Warp are described and compared with vector supercom- 
puters such as Cray-1S and Cyber-205. These kernels were 
integrated into the general sparse linear system solver IC- 
PCCG and exercised on sparse linear systems derived from 
production finite element applications. Speedups of more than 


100 over the VAX/780 with floating point accelerator are 


achieved. . Finally, we consider the problem of solving finite 
difference equations on a square mesh. For the model 
problem of Laplace’s equation on a square mesh of 500 by 
500 unknowns, Warp is able to achieve 14.6 MFLOPS using 
the generic SOR relaxation scheme and 49.4 MFLOPS using 
the 2-color SOR relaxation scheme. 


2. The Warp Machine 

A brief overview of the Warp machine is given below. (for 
architectural details, programming tools, and its other applica- 
tions see [1, 2, 3]) The Warp machine has three components - 
the Warp processor array, or simply Warp array, the interface 
unit, and the host, as depicted in Figure 2-1. The Warp 
processor array performs the bulk of the computation. The 
interface unit handles the input/output between the array and 
the host. The host has two functions: carrying out high-level 
application routines and supplying data to the Warp processor 
array. 


The Warp processor array is a programmable, linear sys- 
tolic array, in which all processing elements (Warp cells) are 
identical. Data flow through the array on two data paths (X 
and Y) (as shown in the Figure 2-1). Each Warp cell contains 
one floating-point multiplier, one floating point ALU and one 
integer ALU. The floating-point units can deliver up to 5 
MFLOPS each. This performance translates to a peak 
processing rate of 10 MFLOPS per cell or 100 MFLOPS for a 
10-cell processor array. A 32K-word memory is provided for 
resident and temporary data storage. The datapath of a Warp 
processor cell is shown in Figure 2-2. The host is a general 
purpose computer (currently a Sun workstation, with added 
MC68020 cluster processors for I/O and control of the Warp 
array). It is responsible for executing high-level application 
routines as well as coordinating all the peripherals. 


INTERFACE 
UNIT 


Figure 2-1: The Warp systolic computer 
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Figure 2-2: Warp cell datapath 


A feature that distinguishes the Warp cell from many other 
processors of similar computation power is its high intercell 
communication bandwidth - an important characteristic for 
systolic arrays. Each Warp cell can transfer up to 20 million 
words (80 Mbytes) to and from its neighboring cells per 
second. We have been able to implement this high bandwidth 
communication link with only modest engineering efforts, 
because of the simplicity of the linear interconnection struc- 
ture and clocked synchronous communication between cells. 
This high inter-cell communication bandwidth makes it pos- 
sible to transfer large volumes of intermediate data between 
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neighboring cells and thus supports fine grain problem 
decomposition. For communicating with the outside world, 
the Warp array can sustain a 20 Mwords/sec peak transfer 
rate. In the current setup, the host can only support up to 2 
Mwords/sec transfer rates. 


3. General Sparse Linear Systems 

The compact matrix storage structure makes sparse matrix 
computations different from those for dense matrices. Figure 
3-1 shows a widely used storage structure for general sparse 
matrices. The A array stores the non-zero elements of the 
matrix, the JA array stores the column index of each non-zero 
elements, and the JA array is an index array which points the 
starting element of each row in the A and JA arrays. This 
compact matrix format is used in general sparse matrix 
packages such as the Itpack [8]. 


We identify two sparse matrix kernels which dominate the 
computation of the iterative solvers under our consideration. 
They are 


® sparse matrix vector multiplication; 


® sparse triangle system solving. 
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Figure 3-1: Sparse matrix storage 
3.1. Sparse Matrix Vector Multiplication 

Consider the algorithm for multiplying a sparse matrix with 
a dense vector y’=Ax: 


for i := 0 to n-1 do begin 
jbgn := IA[i] ; jend := IA[it+1] - 1; 
sum := 0.0; 
for j := jbgn to jend do begin 
sum := sum + A[j] * x[JA[j]]; 
end 
y[i] := sum; 
end 


The algorithm steps through the sparse matrix row by row and 
does an inner product of the sparse row vector with the dense 
vector X. The inner product computation is optimized by 
collecting elements from the dense vector indexed by the 
Sparse vector to avoid multiplication and addition with zeros. 
This process of randomly collecting elements from a dense 
vector to match a sparse vector is known as the gather opera- 
tion. The sparse matrix vector multiplication algorithm is an 
example where the innermost loop is sequential while the 
outer loop is completely parallel. To parallelize the computa- 


tion, we simply distribute the rows of the matrix to the 10-cell 
array by interleaving, that is, cell i has rows 10k+i, for 
0>k>Ln/10]. The dense vector is duplicated on all the cells. 
Because of the 7-stage pipelined floating point adder, a Warp 
cell can only do the sparse dot product step at the rate of one 
every 8 cycles, that is, 1.25 MFLOPS out of its LOMFLOPS 
peak performance. The 10-cell Warp array can achieve 12.5 
MFLOPS in sparse matrix vector multiplication. This perfor- 
mance figure is as good as or better than supercomputers such 
as Cray-1S (Gather: 5 to 11 MFLOPS, Peak 210 MFLOPS) 
and Cyber-205(Gather: 4 to 17 MFLOPS, Peak 800 
MFLOPS) [4]. Note that, Warp is a single precision machine 
while the performance on CRAY-1S and Cyber-205 are 
double precision results. The relatively bad performance on 
vector computers is due to their heavily pipelined memory 
system and vector oriented processor units, which do not 
perform well in random indirect addressing and short vector 
computations. 


3.2. Sparse Triangular System Solving 

Sparse triangular system solving is an inherently sequential 
process. Consider the algorithm for solving sparse lower 
triangular system AXY=y: 


for i := 0 to n-1 do begin 
j := IA[il; 
sum := y[i]; 
while ( JA[j] < i) do begin 
sum := sum - A[j] * x[JA[j]]; 
3 := jt; 
end 
x[i] := sum, 
end 


and similarly the algorithm for solving sparse upper triangular 
system Ax=y: 


for i := n-1 to 0 do begin 


jJ := IA[it1]-1; 

sum := y[il]; 

while (JA[j] > i) do begin 
x[i] := x[i] - A[j] * x[JA[j]] 
la le 

end 

x[i] := sum; 

end 
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L1,L2,L3are general sparse matrices. 


Figure 3-2: p-color ordered sparse triangular system, p=4 
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The innermost loops of these algorithms are gather opera- 
tion and its outer loops are strictly sequential, which is not the 
case in matrix vector multiplication. The technique of multi- 
color reordering suggested in [11,10] is used to restructure 
the sparse matrix and parallelize the computation of sparse 
triangular system solving.. As shown in Figure 3-2, a p-color 
sparse triangular matrix has p identity blocks along the 
diagonal. Note that they are blocks of identity matrices, not 
blocks of identical size. A triangular system with such a 
structure can be solved in p—1 steps instead of n—1 steps, 
where v is the degree of the system. Each step of the solving 
is a sparse matrix vector multiplication, for example, Figure 
3-3 shows the 3 steps for solving a 4-color triangular system. 
Since sparse matrix vector multiplication can be done in 
parallel, the sparse triangular system solving is thus paral- 
lelized. 


Step 3 


Figure 3-3: Forward solving : a 4 color ordered system 


3.3. The IC-PCCG Solver 

The Conjugate Gradient (CG) method was developed by 
Hestenes and Stiefel in 1952 [7] and subsequently widely 
used for solving minimization problems. Only since the 70’s 
has the CG method been used for solving linear systems of 
equations with the symmetric positive definite (SPD) 
property. Its success is connected with the development of 
the Pre Conditioned CG (PCCG) iterations. The Incomplete 
Choleski precondion (IC-PCCG) [9] is one of the most suc- 
cessful general purpose precondition strategies popularly used 
in practice. 


Let A be a symmetric positive definite (SPD) n by n sparse 
matrix. We want to solve a linear system with AX =B’. If we 
define the error functional as 


F(?) = 1/2(02-7)'A@-?) = 1/2(7?)'A"(P) | 


where X¥—z’ is the error vector and the residual vector 7?’ is 
defined as F = b’—Az’. This functional is minimized by the 
exact solution of AZ’ =%. The CG method prescribes how to 
choose a sequence of approximations 7 , Such that the func- 


tional F(®,) is minimized in an optimal way. In the steepest 
descent method the new approximation ¥,,, is found in the 
direction of the gradient, which is the residual vector 7”,. 


The CG method converges in at most n iterations in the 
absence of round-off errors, the convergence rate is strongly 
determined by the clustering of the eigen values of the A 
matrix. The generic CG method is not practical for applica- 
tions because of its slow convergence rate. The PCCG method 
is thus introduced, which instead of solving the system 
Ax = 8’, solves the preconditioned system 


(M~laye = (MB) 

where M is the precondition matrix with the properties that 
e M is positive definite 
e MA has better spectral properties than those of 


A, that is, a smaller spectral radius and more 
clustered eigen values. 


e It is relatively cheap to solve a system with M, 
M7 =a. 
Although the detailed theory of PCCG may be complicated, it 
turns out that the approximation of ¥ , can be simply com- 
puted by the following iterative algorithm with ¢ as the stop- 
ping criterion. 


Initialization: 

X =0 

P QO B 

Solve: Mz’)=B 

k=0 

Iteration: 

While V?,¢ 7, > (do 
Py, ° AP, 


_ =~ 
Rey HX yt Py 


= _ 7 _ = 
Pee 7 WAP 

“ —> —_ 7 
Solve: Mz k+l” k+l 


> = 
" Fe OT kel 
; 2 ery 


Psy =F ea + BPE 
k=k+1 


For IC-PCCG algorithm, the precondition matrix 
M=LDL! is derived from the incomplete Choleski decom- 
position of A, where D is a diagonal matrix and L is a lower 
triangular matrix with the same sparse pattern as A. A detailed 
discussion of IC-PCCG algorithm can be found in [9]. In 
‘addition to the vector additions (@+s’) and inner products 
(@e), a sparse matrix vector multiplication and two sparse 
triangular system solving are kernels used inside an IC-PCCG 
iteration. 
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In our implementation of the IC-PCCG algorithm, matrix A 
is first multi-color renumbered and then /M is derived from the 
renumbered matrix by incomplete Choleski decomposition. 
Matrices A and M are ditributed to the 10 processor cells by 
row interleaving. Vectors p’, 7’, Z and X are also distributed 
to the 10 processor cells by interleaving. Scalars are dupli- 
cated in all cells. One working vector of length n is allocated 
in each cell for sparse matrix vector multiplication and tri- 
angular system solving. The p’ vector is copied to the work- 
ing vector before Ap is performed. A segement of the 7’ 
vector is generated after a parallel forward (backward) sub- 
stitution step completes. The segement is then copied to the 
working vector before the next parallel forward (backward) 
substitution step starts. The inner products are done by com- 
puting the partial result in each cell. All the partial results are 
summed together from the first cell to the last cell then broad- 
cast backwards from the last cell to all the cells. Vector 
additions are naturally parallelized because the vectors are 
distributed. Scalar computations and convergence test are per- 
formed by all cells, that is, sequential computations are dupli- 
cated. The host is not involved in the iterative process at all, 
thus its limited host I/O bandwidth does not affect the com- 
putation. 
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Figure 3-4: Copy a distributed vector to all the cells 


Copying a distributed vector to all the cells is the major | 
communication overhead in this mapping. With the support of 
the systolic communication pathway, we are able to reduce 
this overhead significantly. Figure 3-4 shows a method to 


achieve fast communication. The method can copy a dis- 
tributed vector of length n to all the cells in n+c—2 cycles, 
where c is the number of cells. Limited by the local memory 
bandwidth, this result is only c—2 cycles away from the best 
possible achievable result of n cycles. We have exercised our 
IC-PCCG implementation on sparse matrices arising in the 
finite element analysis applications from GE Corporate 
Research and Development, GE-CRD. Limited by the small 


cell memory, these matrices have no more than 4000 un- | 


knowns. The performance of GE production IC-PCCG code 
run on a VAX-780 (with floating point accelerator) VMS 
system is compared with Warp. Warp is more than 100 times 
faster, depending on the sparsity of a matrix. 


4. Regular Sparse Linear Systems 

In this section, we describe the mapping methods for solv- 
ing finite difference equations on a regular mesh, or equiv- 
alently a sparse linear system with multiple nonzero diagonals 
in the coefficient matrix. Problems of this type frequently 
arise in numerical solution of partial differential equations by 
finite difference approximation. The regular structure of the 
sparse matrix makes index vectors obsolete and more effec- 
tive mapping schemes can be used to improve the computa- 
tion speed. 
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Figure 4-1: Descretized square domain (5-point stencil) 

For simplicity and ease of presentation, we illustrate the 
example of solving Laplace’s equation on a square domain 
with a Dirchlet boundary condition. The domain is 
descretized using the five point difference scheme , as shown 
in Figure 4-1. Linear equations of the form 


Au; ja Mitt jin p =O 


are obtained for’ the interior grid points. The solution of u; ; 


can be derived by the method of Successive Over Relaxations | 


(SOR) [13]. Of course, the problem can be solved using other 
fast methods [12]. But, it makes a convenient example with 
which we are able to illustrate a mapping scheme for general 
problems of this type. The SOR iteration can be formulated 
by the recurrence equation: 


k k—1 k k k-1 —k-1 
where the super script denotes the iteration number and © is 
the relaxation parameter. The generic SOR iteration is con- 
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sidered sequential on vector computers. It can not be vec- 
torized in either the column or the row dimension because of 
the recurrence definition in the algorithm. For Warp, even the 
nested recurrence computation can naturally be parallelized 
using the systolic pathway. 


Our mapping is based on a simple domain partition, no 
preprocessing is needed to achieve the splitting. Consider the 
mapping on a linear array of two processor cells. The mesh 
of unknowns is evenly partitioned into 2 cells on the column 
dimension with one column overlapped, as shown in Figure 
4-2. In one relaxation step, the computations between two 
cells are scheduled as follow. When cell 0 completes the 
computation for the first half of row i, it sends the overlapped 
element to cell 1 and continues to compute row i+] in its 
domain. In the mean time, cell 1 receives the value generated 
by cell 0 and continues the computation on the second half of 
row i. Cell 1 sends its overlapped element back to cell 0 
when it is generated. The datum is stored in the queue and 
will not be retrieved by cell 0 until it finishes the computation 
of the first half of row i+1. This process repeats until the last 
row. Figure 4-3 illustrates the communication between two 
cells. 
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Figure 4-2: Domain partition 


In this example, the queue between processors is used both 
for communication and synchronization, a unique feature of 
systolic arrays. With the combined communication and 
synchronization scheme, the mapping is free of overhead. 
The zero cost of synchronization remains as the number of 
processor increases and the granularity between synchroniza- 
tions decreases. This nice property can not be achieved in 
many shared memory’ multiprocessors, where the 
synchronization is done sequentially. A complete SOR algo- 
rithm needs to compute the norm of error vector between 


successive iterations to determine the convergence of the 
solution. The norm of error vector is computed locally inside 
each cell. After one relaxation step is completed, the partial 
norms of all cells are combined together from the first cell to 
the last cell then broadcast backward from the last cell to all 
the cells. The convergence test is done by all the cells. As in 
the IC-PCCG algorithm, the host is not involved in the itera- 
tive process. For a mesh of 500 by 500, the Warp computer 
can finish one SOR iteration with convergence test in 187 ms, 
which is 754 ns per point per iteration or 14.6 MFLOPS. The 
bad absolute performance is caused by the 7-stage pipelined 
processing unit inside each cell. A simple fix to avoid the cell 
pipeline problem is to use the 2-color relaxation scheme. In 
the first half of an iteration, we update the unknowns iis 
where i+j is even. In the second half of an iteration we update 
Ui, where i+/ is odd. Each half iteration is completely parallel, 
thus the cell pipeline unit can be utilized more effectively. 
One complete 2-color SOR relaxation for the 500 by 500 
mesh can be done in 54 ms, which is 224 ns per point per 
iteration or 49.4 MFLOPS. 
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Figure 4-3: Parallel SOR relaxation on a 2-cell array 


5. Conclusions 

We have demonstrated that a linear systolic array of power- 
ful processors like Warp can be used effectively in solving 
sparse linear systems. The high bandwidth systolic intercell 
pathway is very powerful for fast communication and 
synchronization. It is used to reduce the communication 
overhead in the IC-PCCG algorithm and to parallelize the 
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nested recurrence computation in the generic SOR relaxation. 
The MIMD array is useful because multiple gather operations 
in the sparse matrix vector multiplication can not be se- 
quenced by a single instruction stream across the array. It is 
the heavily pipelined processor cell, not the linear array, 
limits the achieved performance. The cell’s single precision 
floating point arithmetic and the small local memory capacity 
also limit the Warp computer’s use for large scale sparse 
matrix applications. 
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Abstract — A general methodology to map the computa- 
tions of two dimensional systolic arrays onto one dimensional 
arrays is developed. Since two dimensional arrays have been de- 
veloped for a large class of problems, using our technique they 
can be translated into one dimensional arrays with bounded I/O 
bandwidth requirement. As applications of our methodology we 
show a) improved linear systolic arrays for several matrix ori- 
ented computations such as matrix multiplication, transitive clo- 
sure and dynamic programming, b) systolic arrays with tradeoff 
between number of PEs, local storage and I/O bandwidth and 
c) fault tolerant systolic designs which can be implemented in 

afer Scale Integration. Compared to known designs in the lit- 
erature our methodology leads to modular systolic arrays with 
constant hardware in each PE, few control lines, lexicographic 
data input/output format and improved delay time. 


1. Introduction 


VLSI arrays have been designed to implement cost effective and 
efficient parallel solutions in hardware. Using this methodol- 
ogy, parallel solutions to a large class of numerical, signal and 
image processing problems have been implemented in hardware 
[7][15]. Most of these designs consist of two dimensional array 
of PEs which solve problems involving O(n?) computations in 


O(n) time using O(n”) PEs. In general, such two dimensional 
arrays have au) I/O bandwidth and hence data can be easily 
aligned for the desired operations to be performed as the data 
flows through the array. 


Recently design of parallel algorithms for linear arrays has 
become increasingly important [23][12][3]. Linear arrays offer 
several advantages compared to two dimensional arrays. They 
require constant I/O bandwidth. As the problem size becomes 
larger, linear arrays become attractive to implement because 
only a fixed number of I/O pins are needed on the chip. Also, in 
Wafer Scale Integration (WSJ), it has been shown [8] that unidi- 
rectional linear array structure leads to 100% utilization of good 
PEs. No technique is known to result in high PE utilization on 
a wafer in case of two dimensional arrays (in the worst case) [29] 
[4]. However, due to the limited I/O access in the one dimen- 
sional arrays, the mapping techniques in the literature cannot 
be directly utilized to design such arrays or result in complex 
systolic designs. Some of the known linear systolic array design 
methodologies lead to bidirectional data flow which is not de- 
sirable in wafer scale integration [29] [13]. Other methodologies 
lead to complicated control and nonuniform I/O which makes it 
difficult to interface with the host [23] [24] [28] [29]. 


In this paper, we develop a general mapping technique to 
map the computations of two dimensional arrays onto one di- 
mensional arrays. Using our method, ”clean” linear systolic ar- 
rays can be designed for a general class of problems (including 
Matrix Multiplication, Transitive Closure, Dynamic Program- 
ming etc.) for which two dimensional arrays have been designed 
in the past. The resulting linear arrays have continuous I/O se- 
quence and modular extensibility property. Using our method- 
ology, family of linear systolic arrays for matrix multiplication 
and related problems in signal and image processing can be de- 
signed, exhibiting tradeoff between I/O bandwidth, local stor- 
age, processor complexity and number of PEs. In addition, our 
technique also leads to designs with unidirectional flow of data 
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and control which makes our designs easily implementable in 
well known reconfiguration schemes proposed for WSI. 

The rest of the paper is organized as follows. In section 2, 
we present our technique to map algorithms onto linear systolic 
arrays which results in simple designs for many problems. In 
section 3, we apply our mapping methodology to design 1) lin- 
ear systolic arrays for some matrix problems, 2) family of linear 
arrays, and 3) fault tolerant linear arrays. Finally, some com- 
parisons and conclusions are made. 


2. Mapping from 2-D to 1-D arrays 


In this section, we discuss the basic idea of our mapping tech- 
nique and its limitations. Timing analysis and details of imple- 
mentation are discussed in the following sections. 


Our technique starts with a two dimensional systolic array. 
For clarity of presentation of our ideas, we will consider a two 
dimensional array to compute C=A x B (where A, B and C are 
matrices) as an example. A 4x4 array is shown in figure 1 where 
the PEs are numbered in row major order. The input data in 
each row (and each column) flow through the row (and column) 
of the array. For example, a1; passes through PE, PE2, PEs, 
PE, and 6; passes through PE,, PEs, PE 9, PE,3. All the 
computations to compute C;; are performed in PE(G_1)+n+j5- AS 
a data item passes through a PE it performs computation with 
the data arriving at its other input and updates C,;. For exam- 
ple, in figure 1, if the computations begin at t=1, then agg will 
meet b34 at time t=7 in PEg to calculate C4. 


One way of mapping the above two dimensional array into 
a linear array is by partitioning each row and stretching it with 
their links in row major order as shown in figure 2. Thus, the 
resulting linear array will have n? PEs. The PEs in a row in 
the 2-D array correspond to a block (of n PE’s) in the 1-D 
array. However, the array in figure 2 is not the desired linear 
array. The data has to be fed to internal PEs in the array, 
which is not allowed in a one dimensional array. The desired 
structure is shown in figure 3. It features local connections with 
I/O performed at the leftmost and rightmost PE. The input 
matrices A and B are partitioned into A and B bands as shown 
in figure 1. The input data is fed at the leftmost PE of the array 
as shown in figure 3. 


In order to enforce the linear array in figure 3 to simulate the 
1-D array in figure 2, we have to address the following problems. 


1. Activation of PEs 

This problem is concerned with activating the PEs to per- 
form operations when the desired operands arrive at a 
PE. In the above example, in order to simulate the op- 
erations of the 2-D array with linear array, we use some 
control signals to let aj; be ‘activated’ from PE, to PE, 
and be ‘deactivated’ in other PEs. When aj, is acti- 
vated in a PE, it performs a computation (of the form 
Cia — C11 + a4, * b1;) with b;, in PE}, 1 <9 < 4. Sim- 
ilarly, 64; is activated only in PE,, PEs, PE , PE3 to 
compute with a,;1, 1 << 4. In other words, in the lin- 
ear array, a3; is transported from PEs to PEj¢ without 
doing any operation and 6j, is transported through PE», 
PEs, PEa, PEeg, PE7, PEs, PEjo, PE, PE, PE\a, 
PEs, PEig while being deactivated. In the above ex- 
ample, the element a;; is to be activated at PE(;_1)anim 
where 1 < m < n, and 6;; be activated at PE(p_1)sn+; 
with l<k<n. 


2. Operand alignment 
This problem is concerned with ensuring the right operands 
meet in a PE to perform an operation. In order to sat- 
isfy the alignment of operands in the linear array, we use 
two types of channels: fast and slow channels. Suppose 
the data in the fast channel takes a time units to pass 
through a PE while the data in the slow channel takes 7 
time units where a < 7. If the elements of A (B) matrix 
are fed into a slow (fast) channel in a column (row) major 
order, then, for y = 2 and @ = 1, if ay and b,j; reach 
PE, at time to, then a;, will meet b,,;41 at time to + 1 at 


PEo41. 


3. Transportation of data from row to row 

This problem is concerned with simulating the movement 
of data from one row to another row in the 2-D array 
(those data crossing the dashed lines in figure 1) on the. 
one dimensional array. We use an extra channel (in the 
above example, BS) to transport this data within each 
block, the data to be used by the PEs in the next block 
is stored in this channel. | 


In summary, a general design methodology is as follows: 


1. Start with a 2-D systolic array with data flow along the 
positive coordinate direction. Without loss of generality 
let the array size be m Xx n without diagonal connections. 


2. Partition the array and compress it into a linear array. A 
general partition rule will be given in section 3.3. 


3. Assume there are z channels along the X axis and y chan- 
nels along the Y axis connecting adjacent PEs in the 2-D 


array. Each horizontal channel.in the 2-D array corre-_ 


sponds to one slow channels connecting adjacent PEs in 
the resulting 1-D array and each vertical channel corre- 
sponds to a slow and a fast channel. Thus, we have z+ 2y 
channels connecting PEs in the 1-D array. 


4. Feed the A and B bands into the leftmost PE. 
5. Design a scheme to solve 


(a) The activation of operations (using control signals). 


(b) The alignment of operands (using fast and slow chan- 
nels). 


(c) The transportation of operands (using transporta- 
tion channel and mechanism to switch data chan- 
nels). 


6. The delays in each PE can be determined by the following 
procedure: 


(a) Let the amount of delay within each PE of each 
channel be a parameter. 


(b) Using the design scheme in step 5 above, obtain tim- 
ing equations of channels involving these parame- 
ters. | 


(c) Using the alignment and activation requirements, 
obtain constraint equations to assure that the de- 
sired data meet in the active PEs. 


(d) Using the timing equations and constraint equations 
choose the optimal set of parameters to minimize 
| delay. ~~ 
An important requirement for this methodology to be appli- 
cable is the data flow in the original two dimensional array is 
unidirectional along coordinate axes. The following proposition 
states that our methodology can be applied to arbitrary two 
dimensional arrays [9]. 


Proposition 1 Ifa computation can be performed in O (n) time 
on a two dimensional n x n systolic array, then it can be trans- 
formed such that the resulting array has O(n?) PEs and the data 
flow ts unidirectional along X and Y axis with no asymp- 
totic loss in time. This array can be further transformed into a 


unidirectional linear array using the proposed mapping tech- 
nique. 


3. Applications 


In this section, we illustrate our mapping technique by designing | 
linear systolic arrays for several applications. | 


3.1 A linear array for Matrix Multiplication 


Consider the 2-D array for matrix multiplication as shown in 
figure 1. The linear systolic array shown in figure 3 consists 
of n? PEs numbered 1,...,n? from left to right. The PEs are 
connected by three data channels which carry the input data, 
i.e. a fast Channel BF for elements of B, and two slow channels 
AS and BS for the elements of A and B respectively. One bit 
wide control lines ACT, I, J connect adjacent PEs. All the 
control signals and data move from left to right only. We will 
use the above connections to solve the following problems (step 
5 in our method). 


Activation of PEs 


‘When the AS and BF channels have data which commute in 
a PE, then the PE must be acttvated to perform a computation 
of the form C;; <= Cj; + aix * 6,;. We implement this by 
inputing a control signal denoted ACT at the left end of the 
array to set a flag ACTIVE inside each PE. A PE is said to be © 
active if it has ACTIVE set to 1. It will then perform a partial 
product computation during that clock period. In our design, 
when ACTIVE=1, a;, in AS channel is multiplied with 6;; in 
BF channel. 


Alignment of Operands 


To get the correct operands together to perform an operation 
in each PE data channels with different speeds are used to align 
the operands. There are two types of alignment in our matrix 
multiplication design. The first type concerns the alignment 
of operands within a row. The second type concerns with the 
alignment of operands from block to block. For example, an 
activated a;, (in the «*" block) has finished all its operations 
with 6,38 (1 < j < n) when it reaches PEy.;, 4;41,4 (which 
immediately follows a;,) is activated and performs operations 
with 6,;8 in the next block (These 5;,;s in BS are also copied to 
BF at the beginning of block; to supply the operands that are 
needed in that block). To implement this type of alignment we 
use a multiplexer M, to make the data in the AS channel to 
gain one time unit at the last PE of each block. This multiplexer 
is controlled by a flag which is set at the last PE of each block. 
Flag ~ is set by control signals J and J whose operations are 
described in the appendix. Notice the signal in ACT channel 
does not gain one unit of time at the end of each block. Thus, 
in block;+1, @j41,4 is active if aj, was active in block;, for some 


Transportation of Data from Row to Row 


To simulate the data flow from one row to another row in the 
2-D array on the 1-D array, an extra slow channel BS is used to 
transport the data. The data from a block to its adjacent block 
is saved in this slow channel. This data will be used by the PEs 
in the next block (by copying the data in slow channel BS into 
the fast channel BF at the beginning of that block) as operands 
of that block. Switching of data from BS to BF is implemented 
by multiplexer Mg which is also controlled by the flag ». 


The overall system structure is shown in figure 4. The struc- 


ture of PE is shown in figure 5. The operation of the PEs is as 
follows: 


Read data into registers from input ports. 
If (ACTIVE=1) then C <= C + AS.LR#BF.R 
If (y~=1) then 
begin 
Ma selects data from AS.LR. 
Mp selects data from BS.RR. 
end ! 


AO 


else 
begin 
Ma selects data from AS.RR. 
Msgs selects data from BF.R. 
end 


The algorithm uses a simple data input sequence in which the 
data is input in every clock continuously without any delay. At 
to, 411 18s fed into AS channel, and bj; is fed into BS and BF 
channels of PE, (leftmost PE) in the array. Matrix A is fed 
in column major order, i.e., a11, @21, @31,°**,@n1, Q19,°°°,a 
Matrix B is fed in row major order, i.e., bi1, 612, big,°++,bin, 
b21,°**;bnn. Also, the control input ACT is set to 1 every time 
Qik, 1 < k <n, is inserted into the array. 


mre 


Timing Analysis 


By assuming the delay of each channel within each PE as 
a parameter, the transportation of data in each channel can be 
described by timing equations and the alignment and activation 
requirements can be described by constraint equations. Using 
these timing and constraint equations, optimal parameters can 
be chosen to minimize the delay. In the following design we 
assume that the computations begin at t=1. 


Timing equations 

Let 

I(a, u,v) 

= time at which ay, is input to PE, 1<u,u <n. 
=(v-1)*n+u 

I(b, r,s) 

= time at which b,, is input to PE,, l1<r,s<n 
=(r-—l1)*#n+6 

I( ACT, k) 

= time at which k’* ACT = 1 is input to PE,, 1<k<n 
=(k-1)*n+1 (3) 
The following are the timings of the data a,,, b,, and control 


signal ACT that appear at processor p, 1 < p < n?. PE, 
computes C;; where p is given by, 


(1) 


(2) 


p=(t—I)ent+j 


(4) 


1. For ayy, 1 < u,v <n, 


t(a, u, v, p) 
= time at which ay, appears at PE, 


= I(a,u, v) + a(p — 1) - B|(p— 1)/n] (5) 


In the above equation, the first term is the time at which ay, 
was input to PE,. The second term is the delay experienced in 
AS channel of (p — 1) PEs. a is the delay of the AS channel 
within a PE. The last term corresponds to the time gained at 
the end PE of those |(p—1)/n| blocks in front of PE,. # is the 
time gained by a, at the end of each block. 


2. For b.,, 1<r,s <n, 


t(b, r, s, p) 
= time at which b,, appears at PE, 


= I(b, r,s) +7 * |(p—1)/n] *n + 6((p— 1) mod n) (6) 
In the above equation, the first term corresponds to the time at 
which 6,, is input to PE,. The second term corresponds to the 
delay experienced by the data as it travels in the BS channel 
in all the blocks before the block to which p belongs. ¥ is the 
delay in the BS channel. The last term is the delay experienced 
within the block to which PE, belongs. 6 is the delay in the 
fast channel BF. 
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3. For ACT signal, 


t( ACT, k, p) 
= time at which k‘* ACT = 1 (denoted as ACT,) appears in 
PE, 


= I(ACT, k) + w(p — 1) (7) 
In the above equation, the first term corresponds to the time at 
which ACT signal is input to PE,. The second term corresponds 
to the delay experienced by the data as it travels in the control 
channel. w is the delay of the control channel within each PE. 


Constraint equations 


In order to correctly perform matrix multiplication, we need 
to implement the following operation: activate PE, to perform 
operations of the type C;; <= C;; + ay, * 64; during the acti- 
vation period, where p = (¢— 1) *n+ J. That is, when ACT, 
arrives at PE», the specific aj, and b,; should also be in that 
PE. Thus, the data ayy, b,, and ACT arriving at a PE must 
satisfying the following conditions: 


lu=17 

2:89 

oe pe i 

4. t(a, u, v, p) =t(ACT, k, p) =t(d, 1, 8, p) 


Using the timing and constraint equations we obtain the follow- 
ing equations: 


l.w-6=1 
2. B=1 
38.w=a=y7 


A set of values satisfying the above are: 


a= 2, pH=i1.47=]=2,0=1.0=2. 


The above parameters mean that in each PE there are 2 time 
units delay in AS channel, a, gain 1 unit time at the end of 
each block. The rest of the delays of channels are: [BS 2], [BF 
1], [ACT 2] where [z, y] denotes there is y units of delay in the 
xz channel. O : 


The above analysis leads to: 


Theorem 1 The 1-D array correctly performs all the compu- 
tatrons of the 2-D array for matriz multiplication at the end of 
time t = 38n? —n—1, assuming the computation begins at time 
t= 1. 


3.2 Linear Arrays for Transitive Closure 


As another illustration of the mapping technique, we design lin- 
ear arrays for the transitive closure problem by mapping the 
computations of a well known 2-D array. 


A 2-D systolic array for the transitive closure problem has 
been derived in [5]. Its structure is shown in figure 6 which is 
the same as figure 1 except for the end around connections. 


The input is two copies of nxn adjacency matrix, with 1’s on 
the diagonal, read into an n x n array of processors. The output 
is found in the processor array and is read out of the right and 
pet edges. Three passes are needed for the computations 
26). 


Unlike the systolic array for matrix multiplication the data 
in the array is updated at certain times by the PEs in each row 


aay ney moving to the next row (column). Two control 
signals DIAGA and DIAGB can be used to implement this 
update. DIAGA (DIAGB) is associated with a;; (6;;) such that 


if i = j then DIAGA (DIAGB)=1 else DIAGA (DIAGB)=0. 


When PE;; receives DIAG A it updates ,; (i-e., bj;<= Cis) 
if DIAG A=1. Similarly when DIAGB is equal to 1, aj; 1s 
updated (i.e., aj; <== C;;). We call this the update mechanism. 


To map the 2-D array in figure 6 into 1-D array, we use the 
same partitioning and stretching method as in matrix multipli- 
cation. Therefore solution to the alignment, transportation and 
activation of PEs problems are the same in this design. Hence, 
we will only address two major differences compared to matrix 
multiplication: (1) Simulation of the end around connections in 
the 2-D array and (2) implementation of the update mechanism. 


The activation design of PEs in matrix multiplication makes 
the O(n) end around connections in the 2-D array easy to imple- 
ment in the 1-D array. For example, in the 2-D array a1 goes 
from PE, through PE, and back to PE,. This operation can 
be simulated in 1-D array by letting a1; go through PE, to PE, 
activated and from PEs to PE, deactivated and then back to 
PE, again. The same design can be used for the elements in 
B matrix. In this way, 2n end around connections in the 2-D 
array can be simulated by 2 connections in the 1-D array. 


We now consider implementing the update mechanism in 
the 1-D array. The control signals for update mechanism can 
be associated with the data as in the 2-D case. The data to be 
moved from row to row is the B matrix. However, in the 1-D 
case the 6;; that does the operation with a,; is in the BF channel 
while };; to be updated and transported to the next block is in 
the BS channel. This does not lead to any timing problems, 
since the 6;; in BF is moving faster than the corresponding 6;; 
in BS. Thus, the updated data inside a PE must be placed onto 
the slow channel corresponding to the B matrix. 


The linear array is shown in figure 7. There are n? PEs 


numbered 1 to n?. The PEs are connected by the following 
data paths (figure 8). 


1. Two slow channels corresponding to A and DIAGA< in- 
puts. 


2. Two slow channels corresponding to B and DIAGB in- 
puts. 


3. Two fast channels corresponding to B and DIAGB in- 
puts. 


. One bit control signal which passes through the array is 
used to indicate whether the 6;; in the BS channel has 
been updated or not. The control signal, U Pg, together 
with flag NEW za, is used to update the b,; in the BS 
oer It is initialized to 0 when it is fed at the leftmost 


The detailed operation of the PE during each clock period and 
the operation of the array can be found in [11]. 


Using the timing analysis as in matrix multiplication, it is easy 
to show: 


Theorem 2 The 1-D systolic array of figure 10 computes the 


the transitive closure of an Xn adjacency matriz in time 7n? — 
3n+ 1. 


3.3. Family of Arrays for Matrix Computations 


A general methodology to design a family of arrays for ma- 
trix computations is as follows. | 


1. Partition the 2-D array into collection of disjoint rows 
CROW,, CROW,, .... CROW,, r=[n/m]. The number 
of rows in a collection is equal to memory size m available 


in each PE, except CROW, which may have less than m 
rows. 
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2. The linear array consists of [n/m] blocks each block hav- 
ing n PEs. The computations performed by the PEs in 
block; is the same as the computations of PEs in CROW,, 
1<i<r. 


3. Feed the A and B bands of the input matrices at the 
- leftmost PE. 


. Selectively activate the PEs to perform a step of the ma- 
trix multiplication algorithm. 


. Within each block save the elements of B matrix in a slow 
channel which will be used by the PEs in the next block. 
At the end of each block, switch the B matrix data from 
slow to fast channel so that they can commute with the 
elements of A matrix within the next block. 


We will use the above technique to map the two dimensional 
array for matrix multiplication onto two linear array models. In 
each model we will show different partitioning schemes to result 
in an optimal family of arrays for matrix multiplication. 


Variable Memory Family (VMF) Model 


In this model, the number of I/O channels is fixed. Thus, 
when designing a special purpose chip, the number of pins per 
chip is fixed for all members in this family. 


Suppose we can build chips with O(s) storage and an ALU. 
In this scheme, the 2-D array having n row is partitioned into 
collection of disjoint rows as follows: CROW, has s consecutive 
rows starting at (¢— 1)s +1" row of the 2-D array. Thus, there 
are [n/s| CROWs. The resulting linear array will have n[n/s| 
PEs grouped into [n/s] blocks of n PEs. The PEs in the i” 
block perform the computations of the PEs of the ik CROW. 
The computations of PE,;; 1 < i,j < n in the 2-D array is 
performed by PE(m-—1)n+; in the 1-D array, where m= [¢/s] . 
The resulting linear array consists of n[n/s| PEs. 


As an illustration consider 4 x 4 matrix multiplication. A 
partitioning of 2-D array for 4 x 4 matrix multiplication and its 
mapping to linear array are shown in figure 9 and figure 10 for 
s=2. In this example, CROW, has rows 1 and 2 and CROW, 
has rows 3 and 4 of the 2-D array for matrix multiplication. 
Since [n/s|=2 there are two blocks of PEs each block having 
n=4 PEs. Thus, the resulting linear array has 8 PEs as shown in 
figure 10. The PEs are connected by three channels which carry 
the input data: fast channel (BF), and two slow channels AS 
and BS which are used by the elements of A and B respectively. 
In addition, one bit wide control lines ACT, OC, I, J connect 
adjacent PEs. The detailed design can be found in (10]. 


The performace of the design can be summarized as follows: 


Theorem 3 The above method performs the multiplication of 
twon xn matrices using n[n/s| PEs having O(s) memory per 


PE in time t = n? + 2n[n/s| — [n/s| +1. 


Variable Channel Family (VCF) Model 


In this model we assume we can build k I/O channels, 1 < 
k <n, per PE. In this scheme, the 2-D array having n row is 
partitioned into collection of disjoint rows as follows: CROW, 
has 1 + (r — 1)([n/s])* row of the 2-D array, where 1 <r < s. 
Thus, there are [n/s| CROWs. The resulting linear array will 
have n[n/s| PEs grouped into [n/s]| blocks of n PEs. The PEs 
in the i*” block perform the computations of the PEs of the 
i? CROW. The computations of PE;; 1 < i,j <n in the 2- 
D array is performed by PE(m—1)n+j; in the 1-D array, where 
m= (t mod s). The partition graph and its mapping to linear 
array are shown in figure 11 and figure 12 for a 4 x 4 matrix 
multiplication with k=2. 


The systolic array in general consists of n[n/k| PEs where k 
is the number of channels. There are k slow channels AS[t], 1 < 


t < k, which are used by the elements of A matrix. The elements 
of B matrix are fed in row major order, which correspond to B 
bands in figure 1. However, the elements of A matrix are input in 
the following way. Channel AS[l] carries [n/k] elements of each 
column of A starting at the [n/k](I — 1)+1** element. Append 
n— |[n/k| dummy data denoted ”A” at the end of each column 
data. Thus, for n = 4, k = 2 the input sequence is a1, a21, A, 
A , a2, 422, A, A, ayg,... to AS[1]. Similarly ag}, ag3, A, A, 
a32, 242, A, A, agg,... is the input sequence to AS(2]. 

By performing a timing analysis [10], we can show: 


Theorem 4 : The above method performs the multiplication of 
nxXn matrices using n[n/k| PEs, each PE having O(k) storage 
and O(k) I/O channels in time t = n? + 2n[n/k] — [n/k] +1. 


The time complexity of matrix multiplication on both VMF 
and VCF models is the same. The VMF model has fixed num- 
ber of I/O channels. The time available for the execution of a 
scalar multiplication is one clock cycle. Thus, high speed mul- 
tipliers are needed in this design. The VCF model uses more 
I/O channels but if k channels are used, then k scalar multipli- 
cations need to be performed over n cycles. Thus, if k is small 
compared to n then multipliers with low hardware complexity 
is sufficient to implement this design. 


3.4 Designing Fault Tolerant Systolic Arrays 
for Wafer Scale Integration 


The advantages of a special class of linear systolic arrays 
suitable for WSI technology have been reported in [8]. The 
most important property of this type of linear array is that all 
ats data flows 1s in one direction. By modeling a systolic array 
as a directed graph, the following result has been shown in [8]: 


Proposition 2 For any design, if all the edges in a cut set are 
unidirectional, adding the same delay (bypass) registers (which 
simulate faulty PEs) to all the edges in the cut will result in an 
equivalent design. 


As a result, the faulty PEs can be replaced by bypass regis- 
ters. The above discussion can be captured in the following 
fault model which will be used in this paper [29]: 


1. The PEs are arranged in a straight line with a system 
of buses running parallel to them. Each bus can has a 
constant number of buffer registers (per PE) embedded 
in it. The buffer registers correspond to the delay when 
a signal passes through a PE. Also, a switch mechanism 
is used to select the data route of each bus. The route 
depends on the fault pattern. 


2. Propagation delay is assumed to be proportional to the 
wire length. We incorporate this into our design by intro- 


ducing a constant unit of delay whenever a signal bypasses 
a PE. 


3. Asin other models, the buses and switches are assumed to 
be reliable, while the PEs may be faulty. Fault tolerance 
is achieved by hooking working PEs into a desired logical 
structure, in our case, a linearly connected array. 


The complexity of matrix multiplication on this model has 
been studied in [29]. They establish the following lower bound: 


Proposition 3 Any systolic algorithm computing the product 
of nxn matrices using n® scalar multiplications on the above 
model must take Q(n,/n) time. 


In [29], a matrix multiplication algorithm is designed on the 
above model which has O(n,/n) delay. Our technique in section 
2 also leads to a simple optimal fault tolerant array for matrix 
multiplication with improved performance. 
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The intended partition rules are similar to that in VMF 
model. However, the scheme to solve the alignment and trans- 
portation problems is similar to the VCF model. These are 
summarized as belows: 


1. Partition the 2-D array for matrix multiplication into Col- 
lection of disjoint ROWs, CROW,, CROW», ..., CROW jp, 


The number of rows in a collection is equal to ,/n. 


2. The linear array consists of ,/n blocks, each block having 
n PEs. The computations performed by the PEs in block, 
is same as the computations of PEs in CROW;,1<i< 


Jn. 


3. Divide each columns of A matrix (each rows of B matrix) 
into ,/n parts and feed them into AB; (BB;) buses 1 < 
1 < Jn, in column major order (row major order). 


This leads to [9]: 


Theorem 5 The above systolic array computes the elements of 
C=A x B in time 3n/n — 2+ 2r where r is the number of 
faulty PEs. Further all the data flows are unidirectional and the 
distance covered by every signal is one unit in each clock period. 


4. Conclusion 


In this paper, we presented a new technique to design lin- 
ear systolic arrays with limited I/O bandwidth. All our designs 
have simple control, lexicographic I/O and require a minimum 
number of processors. These designs can be shown to be opti- 
mal with respect to area and time [6]. Table 1 compares the 
performance of several designs in the literature with the pro- 
posed design for matrix multiplication on linear arrays. Table 
2 compares the designs for transitive closure on linear arrays. 
Table 3 compares the designs for family of linear arrays. In ad- 
dition, our designs result in unidirectional data flow. Therefore, 
they can be easily implemented in WSI with fault tolerance ca- 
pability. Table 4 compares our matrix multiplication design on 
the fault model with known results in the literature. 


[| This Paper | Method in [20] | Method im [23] | 
[1 Number ofProcesors| on? Sn? 
[2 Delay Time | Sn@—n- i] _&e?—3 ~~ On? _| 
[3 Data mput_ | simple _| need to insert zeros | complex _| 


Table 1: Comparison of Matrix Multiplication on Linear Array 


| 1. Number of processors fo 2n- 1 | 


| 2. Delay Time in? — 3n+ 1] 9nt+n— 2 | 


[3 Area ofPE___| Ol) | Om) _| 


Table 2: Comparison of Transitive Closure on Linear Array 


[Tis Paper | Method in [24] 
[i Number of processors | _alnfol] | > alnjal_| 
[2 Memory Space| ~<a) SSSCd SSC) 
[3 Delay Time [n? F Bnjoln—[nfolt i] > + % | 
a Data input sequence | continuous ‘| complex | 


Table 3: Comparison of family of Linear Array for Matrix Multiplication 


[| This Paper | Method im [29] | 
[DNumber ofprocesorm | myn [nya _| 
| 2.Delay Time Snyn—2 [4njn—n—3yn | 
[B7Total number of buses | 2yn72 | ayn ___—| 


Table 4: Comparison of optimal matrix multiplication 
on the fault model with known result . 
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Appendix 


Setting of » 


I and J are used to set # inside each PE. IJ is set to 1 every 


n clock periods, and J is set to 1 at the start of the operation 
of the array. Thus, for tg < t < t9 + 38n? —n—1. 


ift=to+n*(t-—1),1<t< (n-1). 
otherwise. 


if ¢= fo. 
otherwise. 


The signals J,J are fed at the leftmost PE and are prop- 


agated with delay of one and two units respectively in each 


PE. 


Let I, and J, denote J,J that enter PE, respectively 


(1 < k<n?). Then, 


It is easy to verify that J, = 1 and J, = 
This occurs at time t = to + 2(n * 1) — 1. 


=latt=to+(k-—1)tn#t,1<k<n?., 
Jy =lLatt=to+2(k-1)+1,1<k <n’. 


1 only when k= n¥1. 
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Figure 1: Partitioning a 2-D array Figure 5: The internal structure of PE 
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Figure 2: Stretching to form a linear array 
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Figure 3: Linear array fed with continuous data 
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Figure 6: Transitive Closure on 2-D Array 
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Figure 4: System structure of linear array for matrix multiplication 
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Figure 10: Mapping onto VMF model for n = 4 ands =2 


Figure 8: The PE structure for transitive closure 
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Abstract 


This paper describes the architecture and implementation 
of the CESAR computer system. The computing unit in 
CESAR has from one to four programmable systolic arrays 
working strictly in parallel, representing a SIMD (Single 
Instruction Multiple Data) structure. Each array consists of 
128 custom designed processing elements capable of 
performing bit-serial operations on 32-bit data. Including 
control logic and memory units, a complete CESAR system 
with four systolic arrays is implemented on 13 circuit boards. 
Originally developed for processing of images from Synthetic 
Aperture Radar, CESAR is also. suitable for other 
applications demanding extensive vector processing. 


tro io 


Parallelism and pipelining are two classical concepts which 
have proven to be the keys to exploitation of the huge 
resources offered by today’s VLSI technology.! In the 
CESAR computer system,2-4 parallelism and pipelining are 
combined on different levels to achieve the necessary 
throughput for computationally intensive problems. Focusing 
on processing of images from Synthetic Aperture Radar 
(SAR), the CESAR computer is a result of comprehensive 
research and development activities at the Norwegian Defence 
Research Establishment over the past decade. 


CESAR SYSTEM 


3xAddr 


ememmnt] BUFFER MEMORY 
CONTROL |ammmmmmns 
UNIT 


MEMORY 
HOST 
COMPUTER 


Figure 1 The CESAR Computer System 
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ESAR Architecture 


The fundamental structure of the computing unit in 
CESAR resembles that of a systolic array architecture.>.& As 
shown in Fig.2 a), an 8x16 array of bit-serial processing 
elements operates on strings of data that flow regularly 
through the network and interact where they meet. Each 
serial element (S—-element) is a custom designed 24 CMOS 
chip, capable of performing 32-bit floating point or integer 
arithmetic and logic operations. In parallel with performing 
mathematical operations, an S-element allows data to be 
routed through. By adding programmable time delays for 
synchronization, computed results and bypassed data can be 
merged in neighbouring S-—elements for new computations as 
shown in Fig.2 b). 


INPUT DATA 


OUTPUT DATA C=(A+B)*A 
Figure 2 a) Figure 2 b) 
Figure 2 The Systolic Array of S—elements. 

As shown in Fig.3, the two-dimensional array is 


configured as a cylinder, where pairs of input data are fed 
from the top, and the outputs are tapped at the bottom. 
When a pair of 32-bit input data have been fetched from 
memory, a serial conversion starts, whereby data is shifted 
into the selected column of the cylinder. With an internal 
cycle time of 50 ns, the total shift-in time for 32 bits 
becomes 32*50ns = 1600ns. 
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OUTPUT 
Figure 3 MALU — Microprogrammable Arithmetic Logic 
Unit 


Since the buffer memory is capable of delivering a pair of 
32-bit data every 100ns and also receiving a 32-bit result at 
the same speed, 16 columns can be run in parallel. Enabling 
each column of the cylinder successively, data flows in and 
out of the S—elements as continous bit streams consisting of 
32-bit words lying head to tail. The parallel to serial 
conversion in MALU is accomplished by having three 
distributed shiftregisters in each column of the array. 


Buffer Memory 


Serial Input 


Row 7 


Serial Output 


Figure 4_ Parallel to Serial Converision in MAL 


The S-element has a four bit parallel input/output register 
for each of the three data channels to the buffer memory. 
The eight S-elements in a column together form a 32 bit 
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shiftregister whose output is shifted in at the top. Similarly, 
the results are serially output at the bottom and shifted 
upwards in their respective columns. | 

For many algorithms, the cylinder can be divided into 
strips, each strip performing the same pipeline of 
computations as its neighbouring strips. Representing a 
SIMD structure, this level of parallelism provides a high 
utilization of the computing power available in CESAR. 
Since each S-element is producing a result every 32 clock 
cycles, i.e every 1600ns, the theoretical maximum capacity of 
one complete MALU is ; 


128 
1.6 * 10-6 


flops = 80 Mflops 


This capacity is obtained when, for a certain algorithm, every 
S-element is doing a (floating point) computation. As 
inherent in the CESAR architecture, a complex algorithm 
utilizing many S—elements yields a higher performance than a 
simple algorithm occupying few elements. 

MALU is fully programmable; that is, a combination of 
instruction words in the S-elements constitutes a MALU 
program. The S-elements are fitted with on-chip RAM 
with a capacity of 32 programs (instructions, routing and 
delay) and 32 constants for use in the computations. 
Changing MALU programs between two bursts of data is 
done by merely switching the global program address to the 
arrays. In the applications studied so far, the program 
memory has proven to be large enough to cover the entire 
algorithm without having to perform a program reload. Thus, 
all setup can be done in an initialization phase to avoid a 
degradation of the computational performance. In addition, 
each of the three data paths between the buffer memory 
(BUF) and MALU are easily configurable as either inputs or 
outputs, allowing consecutive refinements of the results 
without having to move data between BUF banks. 

In the current version of CESAR, four identical pairs of 
MALUs and buffer memories are working strictly in parallel. 
During a computation, all the four MALUs execute the same 
program, but on different sets of data. These data (vectors) 
are located at the exact same addresses relative to the start of 
their buffer memories. The system has been designed to 
facilitate the distribution of input data to the four buffer 
memories and the collection of results without any extra 
overhead compared to a single MALU version. Listed in 
Table 1 is a selection of existing MALU algorithms and their 
actual capacity in a four-MALU version: 


Actual cap 


Name of Algorithm (Mflops) 


FFT Radix 4 Butterfly 


[pation of sogion waters 4 


Folding with 4 pt. 


filter 


Table 1 Actual Capacity for Different Algorithms 


Another commonly used way of measuring system 
performance is in terms of time required for a specific 
computation, e.g. a 1024 point complex FFT. On a 
four-MALU CESAR this typical signal processing application 
executes in 0.257 milliseconds (average) compared to 0.4037 
milliseconds! on an 8.5 nanosecond Cray X—MP. 

The execution time is specific for each instruction, which 
affects the time it takes from data enters the MALU array 
until the first results are ready at the outputs. This is often 
referred to as the tail of the computation pipeline and differs 
in length depending on the algorithm. For most signal 
processing algorithms the tail varies from 10-40 psec, which 
for a 32k vector contributes 0.2%-0.6% of the total 
processing time. It should also be noted that once the array is 
filled with data, operands are presented and results are 
produced at the same rate independent of the program 
executed. 

Compared to what we often see in other systolic arrays, 
MALU has several striking characteristics: 


A. MALU. 


Q Each array element is capable of performing relatively 
complex operations. 

Q The array elements have rich connections to their 
neighbours (6 inputs, 6 outputs). 


Q The elements are individually programmable, and 
grouped together they form variable pipelines of 
computations. 


B. Other known systolic arrays. 


Q Array cells are usually limited to simple bit-serial 
operations. 

Q Hardwired interconnections are often used between the 
elements. 

Q Each element is only capable of doing one, dedicated 
operation. 


Hardware Realization 


In Fig.5, a block diagram shows the different hardware 
modules in the prototype version of CESAR, which is 
currently in its final stage of debugging and testing. A full 
system with four MALUs is implemented on 13 PCBs, each 


Figure 5 Hardware Modules in CESAR 
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of size 11’ by 16’. Compared to other systems with 
approximately the same performance, the hardware is 
compact, and, due to the use of CMOS and TTL logic, small 
sized fans is the only cooling necessary. A brief description of 
the modules is given below: 


Q MALU (Microprogrammable Arithmetic Logic Unit) 

A complete array of 8x16 S-elements is fitted on one 
circuit board. The S-—elements are packaged in 68 pin 
PLCCs which are surface mounted on both sides of the 
board. This rather complex hardware solution did, 
however, put some restrictions om the design of the 
S-element in terms of power consumption, and the 100K 
transistor chip only dissipates 0.25W at 20 MHz. 


Q BUF (BUFfer Memory) 
Each BUF contains three separate two—port 2 Mbyte static 
RAM banks for intermediate storage of MALU data. 


QO MAINMEM (MAIN MEMory) 
CESAR has 32Mbytes of main memory for storage of 
intermediate data when BUF space is inadequate. 


Q TRAP (TRiple Address Processor) 
The three bit-slice address processors on TRAP are 
necessary for selecting the correct data to be sent into the 
- MALU and addressing the storage area for the results. 
Each address processor is programmable for different 
addressing algorithms, e.g. data stored with fixed 
increments or FFT bitreversing. 


Q CP (Control Processor) 

CP is based on the Motorola 68020 microprocessor and is 
responsible for the overall control in CESAR. The 
application programs written in the high level language 
CESAR Pascal® as well as system software are executed in 
CP. A local VMEbus? is used to interchange control 
information between CP and the other hardware modules 
in CESAR. 


Q SEQ (SEQuencer) 7 
The Sequencer provides the detailed control signals for the 
CESAR computations. It synchronizes the address 
generation in TRAP with the internal computations in 
MALU to ensure correct dataflow between BUF and 
MALU. 


Q DAP (DAta Port) 
The Dataport controls all DMA transfers between separate 
memory modules, i.e the buffer memories, main memory 
and the multiport memory residing in the host computer. 
DAP also enables CP to access any location in the different 
memories. 


As can be seen in Fig.5, the system is flexible with respect 
to memory access. Controlled and addressed by the Dataport 
(DAP), the physical data transfers take place on the Local 
Data Bus called LBUS. LBUS is a 40 Mbyte/s data channel 
capable of serving all four BUFs with altogether 12 
connections to the MALUs. 

The complexity of the MALU circuit board makes it hard 
to debug in production and in the field. To help solving this 
problem, the S-element has a built-in selftest option that 
enables the system or user to run a parallel diagnostic in all 


512 S-elements in CESAR. The selftest, which is based on 
signature analysis, tests the entire chip with exception of the 
on-chip static RAM. The RAM is verified by the control 
processor before the selftest is initiated. The S-element can 
also be set in a special mode to enhance the testability during 
production testing, reducing the number of testpatterns 
significantly. 


Programming the System 


In parallel with designing the hardware, a substantial effort 
has been put into the development of software tools for 
programming and debugging of the CESAR system. At the 
application level, a high order language called CESAR Pascal 
has been developed.2 In addition to standard Pascal, it 
includes special features for describing and synchronizing 
concurrent processes as well as data transfers between the 
memory modules inside and outside CESAR. 

Typically, a library of the most commonly used 
vector—/signal processing algorithms will be supported. If, 
however, the user wants to write his own MALU or TRAP 
programs, several tools are available. A graphic editor for 
MALU programs allows the user to interactively choose 
instructions and create data paths between the S—elements in 
the array. An assembler automatically adds routing delays for 
synchronization, and a simulator verifies the correctness of 
the algorithm. Similarily, the address processors are 
programmable in a "C”—like language with constructs for 
generating complex address sequences. A TRAP simulator is 
developed to check the address programs before downloading 
to the hardware. 


Conclusion 


A major goal in the research.and development of the 
CESAR computer system has been to create a powerful 
number cruncher for processing of SAR images, while 
retaining a low cost/performance ratio. Preliminary studies 
have also shown that the CESAR architecture provides the 
necessary flexibility to solve other computationally intensive 
vector problems, such as the ones in seismic and 
metheorological processing.4 Also in a variety of other 
applications, the ever increasing demand for extensive 
computing capacity clearly manifests the need _ for 
unconventional, high performance designs like CESAR. 
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Commonly Used Terms 


BUF Buffer memory 
CESAR Computer for Experimental Synthetic 
Aperture Radar 
FFT Fast Fourier Transform 
LBUS Local Data Bus 
MALU Microprogrammable Arithmetic Logic Unit 
Mflops Million Floating Point Operations per Second 
PCB Printed Circuit Board 
PLCC Plastic Leadless Chip Carrier 
SAR Synthetic Aperture Radar 
TRAP Triple Address Processor 
VLSI Very Large Scale Integration 
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Abstract -- Asynchronous digital 
circuits exhibit a high degree of concur- 
rency. self-timed implementation is the 
most appropriate design discipline for 
them. We examine the signal graphs that 
are subject to formal treatment and 
mechanical translation to delay-insensi- 
tive circuits. An example of designing a 
piece of logic for typical interface 
adapter effectively illustrates the 
approach and sheds light on future work. 


1. Introduction 


Modern technologies allow to build 
VLSI circuits whose internal behavior . 
exhibits a high degree of parallelism. 

To operate correctly under the presence 
of such undesired phenomena as electronic 
metastability, signal skews due to higher 
values of wire vs gate delay ratios, 
parametric instabilities of gates etc. 
these circuits are designed using self- 
timed, or delay-insensitive fashion 1,2]. 
The most widely cited examples of concur- 
rent hardware are regular structures like 
pipeline and wavefront arrays which are 
easily decomposed in sequential,parallel 
or recursive way. On the other hand such 
objects as asynchronous interface adap- 
ters which are a lot less regular but 

can be equally concurrent are far from 
being attempted at a formal treatment as 
they have been the privelege of engineers 
using normally timing diagrams or flow 
charts. 

The ultimate goal of our research is 
to mechanize the design process to such 
a degree when it is comfortably fitted 
in a CAD environment for developing dis- 
tributed systems, e.g. for translating 
a physical layer protocol specification 
into a collection of self-timed modules. 
This paper demonstrates the technique of 
using a formal model of concurrency for 
constructing basic units of interfacing 
logic. This technique accomodates a step- 
wise design procedure involving such 
steps like architectural decomposition, 
functional specification of components, 
their behavioral signalling expansion, 
and its validation with respect to 
correctness and completeness notions, 
and finally Boolean function derivation. 


Le Modelling concurrency in logic 


A self-timed system is often 
regarded as a collection of self-timed 
modules that_communicate via asynchronous 
protocols |i}. It does not require a 
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global clock. All system level events 
are ordered in time by the causal 
relations between the modules actions. 
The order as it has been established by 
the designer must further be preserved 
in a final circuit thereby guaranteeing 
the correct operation independently of 
element and wire delays. 

The evolution of logic design 
methods shows that the Huffman state 
machine model is no longer an adequate 
model for asynchronous logic since it 
can not deal with "granulated" concur- 
rency in VLSI. The existing formal 
models for self-timed VLSI systems can 
be split into four groups: 

(i) graphical notations, state or event 
oriented, like Petri nets, transition 
diagrams, parallel flow charts etc.; 

(ii) symbolic notations, like traces or 
path expressions; 

(iii) models based on high level program- 
ming languages, e.g. Ada-like notation; 
(iv) combined models. 

The study of these formalisms shows 
that the usefulness of a model for the 
self-timed circuit design depends on a 
large number of various issues. For 
example, it is affected by the structure 
type (regular vs non-regular, or data- 
flow vs control-flow), the degree or 
granularity of parallelism and data 
dependence, the necessity of abstract 
data typing, the depth of delay-indepen- 
dence (with respect to transistor, gate 
or component level). 

Our formalism, a signal graph based 
on a subclass of Petri nets, is an 
effective substitute for widely used 
timing diagrams because it can be 
analyzed in a mathematically sound 
manner and mechanically translated to 
Boolean functions implementation. 


3. Signal graphs: properties 
and AICUNEELES 


oilgnal graphs are very attractive 
formal model for analyzing behavioral 
specifications of both signalling proto- 
cols and corresponding interface logic. 
They represent a more narrow class of 
processes than that that can be generally 
defined by, say, Petri nets. This is 
concerned with their inability to define 
alternatives in processes. However, when 
we need to define a highly concurrent 
behavior they provide the succinct 
description and what is more important, 
the polynomially complex analysis. 


We presume some knowledge of Petri 
nets and their subclasses, particularly 
marked graphs. Marked graph (MG) gene- 


rates distributive marking diagram (MD) 
3. MD is an oriented gfraph aisse 


vertices are reachable markings and arcs. 
are labeled with firing transitions. The 
term "distributivity" is related to the 
lattice which can be defined on a set of 
vectors of transition firing numbers with 
respect to a given initial marking. 
In order to define a signal graph a 

set of binary variables (signals) 2 = 
ZysZos coer Zny is introduced. We denote 


transitions of signal z.: from O to 1 by 


EE 
+25 and from 1 to O by “Zse 


Signal graph (SG) is defined as an 
MG in Which vertices are labeled with 
Signal transitions (changes) of the form 
dz, where d€{+,-}. 


We call a labeling function conflict 
~free if for each reachable marking and 
variable Zs there is at most one enabled 


vertex labeled with dz; - oG with a 


conflict-free labeling is called coherent. 


The coherence is not sufficient for e 
specification to be correct because 
despite all the changes for each Z, are 


linear-ordered they may be unmatched with 
respect to their signs. 

We call a labeling function sign- 
balanced if for each sequence of signal 
transitions with respect to initial 
marking between any two transitions of 
the same sign there exists at least one 
transition of the other sign. SG with a 
Sign-balanced labeling is called 
consistent. The consistency implies the 
necessary level of correctness of a 
specification given by SG that is 
expressed in the following statement. 


otatement 1. A consistent SG generates a 
state transition diagram. 


A state transition diagram (STD) is 
an oriented graph whose vertices are 
labeled with full states of a specified 


circuit, i.e. they are binary n-tuples 
of values of Ze and arcs are labeled 


with corresponding changes dz. The 


values that can change between a given 
state and another one connected to each 
other by an are are marked with *-token. 
A variable whose value in n-tuple is 
marked with * is called excited in a 
given state. In this paper we omit the 
description of algorithms of converting 
a consistent SG to STD and vice versa. 
We only hint that such a conversion may 
use the ordinary procedure of building an 
MD by the depth-first search where each 
marking in MD relates to a corresponding 
state in STD. 
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A consistent SG may however 
generate a STD with multiple states, i.e. 
the states which are labeled with equal 
n=-tuples of signal values. Such an STD 
is called SEATED Informally, the 
contradiction o 18 Kind means that 
the system is under-specified and some 
components are still hidden from the 
designer's eye. For example, when 5G 
defines an interface protocol these 
components may be interpreted as an 
internal memory of controller. 

We further incorporate a higher 
level of correctness into the hierarchy 
of SG classes by the notion of a normal 
9G .which guarantees the completeness of 
a specification. An SG is called normal 
if it is consistent and for each allowed 
sequence of markings it has no proper 
subset of variables Z'¢ Z which can 
proceed through the full cycle of their 
values while other variables (from 
ZNZ') stay unchanged. An STD of a 
normal SG is non-contradictory and 
distributive |3]. 

It is suitable to check the 
consistency and normalicy using the 
relations of precedence and concurrency 
puilt on the set of Signal transitions. 
The formalization of these relations 
requires the introduction of a concept 
of a history, or so-called unfolding, 
which is an infinite and acyclic object 
generated by an SG. Hach occurrence of a 
transition in an SG yields a unique 
vertex in the unfolding. This technique 
due to the lack of space can not be 
fully described here though we mention 
that the unfolding can ‘be floored to its 
first two periods and the above relati- 
ons can thus be computed on a finite 
object. The algorithm of checking 3 
consistency has the complexity of O(n~) 
where n is the number of vertices in the 
original SG. 

In order to establish whether a 
consistent SG is normal we use a special 
formal concept -— operational coupledness. 
We define a coupled relation on a set 
of variables Z. This relation has. the 
following hierarchy: directly strongly 
coupled, strongly coupled, weakly 
coupled of rank r , r2OQO, and coupled. 
The coupled relation partitions the set 
Z into the disjoint classes. Omiting 
here formal definitions and proofs which 
can be found elsewhere [4] we only state 
the following. 


Statement 2. A consistent 5G is normal 
1ffa 1ts variables belong to single 
coupledness class. 


The complexity of an,algorithm for 
normalicy check is of O(n‘). 

The main advantage of our checking 
techniques stems from the fact that 
they do not require to convert an SG to 


MD or STD - a step having exponential 
complexity with respect to the power of 


4. An exemple of self-timed 
logic design 


In the above section we have 
sketched how we can check the normalicy 
of an SG which is a sufficient condition 
for the existence of a distributive STD 
and hence of a delay-~insensitive circuit 

[2] . The circuit can be derived from 
- the normal SG by means of obtaining the 
Boolean functions (BFs) for variables z. 
of set Z2 using a truth table (TT) which" 
can be built from the STD corresponding 
to the SG. However the chain SG-STD-TT- 
BFs involves exponentially complex steps. 
Therefore we look for an alternative 
technique for the direct (but semantics 
preserving) conversion of the SG to the 
system of BFs. Such a bridling of the 
design complexity is concerned, first of 
all, with laying out some restrictions 
upon the complexity of the coupledness 
hierarchy. 

In this paper we are far from being 
ambitious to show how the problem of 
obtaining the general way of deriving 
functions directly from an SG can be 
solved. We rather illustrate our design 
approach with an instructive example of 
designing a piece of interface logic. 

FIFO buffers are typically incorpo- 
rated in interfacing adapters as they 
help to keep the performance of the 
whole distributed system at its highest 
communication rate. The original specifi- 
cation of a one-value FIFO cell was 
inspired by [5]. 

Let the FIFO cell consist of two 
Subcells: the data cell (DC) and the 
control cell (CC) as shown in Fig.1. 


T/PO O/PO 


DATA CHIL 


T/P1 O/P1 


AO CONTROL CHILL AI 


The structure of FIFO cell 


The meaning of the signals is as 
follows. I/PO and I/P1 are data inputs, 
and O/PO and O/P1 are data outputs. Both 
use the two-rail coding discipline [1 
where for "zero" and "one" values the 
combinations 10 and O1 are respectively 
used on the above pairs, and the all-zero 
spacer (00) is used for representing the 
"data undefined" value. AO and AI are 
tne acknowledgement signals: AO is 
generated by the cell and AT is produced 
by the environment. AD is "All defined" 
indication Signal, AU is "All undefined" 


Figure 1. 
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indication signal, and H is "Hold" 
command Signal. AD and AU are both used 


to detect the state of the inputs (if 
I/PO = I/P1 = O then AD = O, AU =.1, and 
if I/PO # I/P1 then AD = 1, AU = 0). H 
directs the DC to latch the incoming 
value. All AD, AU, and H wires run the 
width of the buffer. 

Fig.e2 shows the SG specification of 
the CC operation. Analyzing this SG we 
can establish that it is consistent: 
each variable has all its transitions 
ordered within one synchrocycle ( a 
cycle containing exactly one token). 
However the SG is not normal. The 
coupled relation partitions the set @ 
{ AT,AO,AD,H,AUS into two disjoint 
classes: K1 = {AI,H}) and K2 = {AD,AU, 
AO} . It can be shown that adding only 
one extra variable to the specification 
while preserving the established order 
of signal chenges for variables in @ 
will not suffice for making all variables 
coupled. After adding two variables d1 
and d2 we obtain the resulting SG shown 
in Fig.3 which is normal. 


An original 5G specification 
of the control cell operation 


OQ GG) 0 
GH@Q-—) G)-C2 
(QHOQ-D—-D 


A normal SG obtained after 
adding extra variables 


Figure 2. 


Figure 3. 


From this 5G we derive BFs in the 
following form: 


Z=m OZ + RZeZ, 


where 5z is the set function and Rz is 
the reset function. Both Sz and RZ are 
independent of z. We also demand that the 
invariant S5z°eRz O holds in order to 
avoid conflicts between transitions 
which may lead to undesired races in a 


circuit. : z 
In order to derive 52 and RZ we 


search through the SG for immediate pre- 
decessors of the transition +z for 
including them into the essential Sza- 
term, and those of transition -—z for 
Rz-term. If these predecessors correspond 
to the variables that are strongly 
coupled with z we proceed further to the 
orthogonalization step. If some of the 


variables whose transition is a predece- 
ssor for a given transition dz is weakly 
coupled (of any rank r20O) with z then 
there is a so-called overtaking of the 
essential term by some other term which 
must be added to corresponding set or 
reset function. For example, when 
deriving function Sig the essential term 


is Hed2 but before AO changes from 0 to 
1H may begin to change from 1 to O (in 
parallel with AO changing), and hence we 
must cure the overtaking by an additional 
term which will involve a variable that 
is strongly coupled with H, i.e. the 

term dled2. Using d2 in both terms for 
SAO helps us also to eliminate the 
inclusion of the term for Rio which is 


simply d2 because d2 is immediately 
strongly coupled with AO. Thus we cbtain 
a BF for AO which is non-selfdependent, 
ise. free of feedback 


AO = (H + d1)*d2 + d2eAO = (H + A1)d2. 


One of the important issues in | 
deriving 5zZ and RZ is their mutual ortho- 
gonalization, i.e. providing that 
OZeRZ O is satisfied. This can be done 
by strengthening their terms with common 
variables. For example, when we obtained 
S10 we had de as such a variable. 


Another ‘example is the function for H 
whose Sry died2 is strengthened by 


di because R,, di. 


Finally, the above technique yields 
the following system of BFFs; 


d1 = ADeAI-d2 + Atl-d1 
d2 = AU-H-d1 + AU-d2 
H = d1i-d2 + adaleH 
AO = died2 + d2eH 


This system is easily implemented with 
four AND-OR-NOf gates and six inverters 
(four of them produce d1,d2,H,A0, and 
the other two complement AI and AU, 
however the latter can obviously be 
eliminated at the transistor level by 
using the inhibit inputs of the first 
two gates). 

The circuit is delay-insensitive 
with respect to delays in gates and 
inverters as well as in those wires 
which are not the feedback connections. 
The feedback delays are presumed negli- 
gible as the corresponding elements are 
accommodated within equichronic regions. 


5e Conclusion 

The main characteristic of the 
above approach comparing it with those 
given elsewhere[6,7,SJis that it 
provides the technique for effective 
managing with concurrency at the logic 
level using the formal model which is 
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quite simple for comprehension for a 
wide audience of hardware designers used 


to timing diagrams, and at the same time 
powerful enough to be formally analyzed 
with respect to correctness and complete- 
ness by means of such key concepts as 
normalicy and coupledness. This facili- 
tates some constructive ways to the 
correction of specifications while 
preserving the original semantics of 
Signal change ordering. The method has 
been tested on a large number of 
difficult examples including designing 
asynchronous control logic for inter- 
faces (Unibus, Futurebus, token ring 
etc.) and FIFO buffers of various 
architectures. 

The proposed technique obviously 
needs further research efforts both in 
theory as, for example, in establishing 
restrictions on coupledness classes 
to find out how they affect the BR 
derivation rules outlined above, and in 
practical aspects through developing 
the software for such a mechanized 
translation to be a versatile interacti- 
ve design environment. Some pieces of 
such an environment are in progress now. 
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Optical Arithmetic Using Signed-Digit 
Symbolic Substitution 
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Abstract A new class of digital arithmetic algorithms is 
presented in this paper for supporting massively parallel 
computing with state-of-the-art optical technology. We use 
a two-dimensional symbolic substitution approach. Signed- 
digit (SD) representation is used to enable carry-free ad- 
dition/subtraction. Based on SD addition, parallel algo- 
rithms for SD multiplication and division are developed. 
The potential advantages of performing digital arithmetic 
with optics include the significant increase in speed, full 
exploitation of massive parallelism, higher communication 
bandwidth, and higher system throughput; as compared 
with existing electronic arithmetic computers. We con- 
centrate on optical computing using the signed digit set 
{1,0,1}. The parallel algorithms being presented can be 
easily extended to perform optical arithmetic with higher 
radices. 


1 Introduction 


The signed-digit (SD) representation was originally pro- 
posed by Avizienis|1], and recently introduced to the opti- 
cal community by Drake et al.[2}. The binary SD system 
uses the digit set {1,0, 1}, where I stands for -1. The intro- 
duction of redundancy ( three values for a binary system) 
provides a much weaker interdigit dependency as opposed 
to the strong dependency manifested by carry propagation 
in a nonredundant number system using the digit set {0, 1}. 
As a consequence of weak dependency, carry generated at 
any stage is confined within two adjacent digital positions 
in the SD code. This makes it possible to perform the addi- 
tion /subtraction of any two SD numbers of arbitrary length 
in constant time[1,3]. 


Based on the SD addition, we have developed new algo- 
rithms for SD multtplicatton and SD diutston. The multipli- 
cation of two n-digit SD numbers is done in O(log, n) time 
by first generating all the n partial products simultaneously 
and then adding them in a tree-like fashion. The parallel 
generation of all partial products is done in constant time, 
independent of the word length n. It is the adder tree that 
requires log, n time. The SD division algorithm is gener- 
alized from the quadratic convergence division method|4]. 
With the provision of high-speed multiplication and paral- 
lel addition, the number of required iterations for SD divi- 
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sion is reduced to O(log, n), where n is the fraction length. 
The advantages of optics have been expounded upon on nu- 
merous occasions|5,6|. These include high space-bandwidth 
and time-bandwidth produts, and inherent parallelism. 


2 Symbolic Substitution Technique 


In order to exploit the massive parallelism and ultra- 
high speed in optics, Huang|7] introduced a technique called 
symbolic substitution (SS) for performing digital arithmetic 
optically. In his method, information is represented by op- 
tical patterns within a two-dimensional image. An optical 
pattern is a spatial arrangement of dark and bright spots 
corresponding to binary values 0 and 1. Computation pro- 
ceeds in transforming these patterns into other patterns 
according to predefined SS rules. Symbolic substitution 
logic is sensitive not only to the values of pixels (picture 
elements) carrying information, but also to their spatial 
locations in the binary image ( image of bright and dark 
spots). 


In order to implement SD arithmetic optically, we need 
an optical encoding for the digit set {1,0,1}. There are sev- 
eral properties of light that can be used. These include Itght 
intensity and I:ght polarization as illustrated in Fig.la-b. 
Using light intensity, two pixels of different light intensity 
are needed to encode the three digits. A possible encoding 
scheme is to represent the digit 1 by a bright pixel above 
a dark one, the digit 1 by a reversed pixel pattern, and 
the digit O by two dark pixels as shown in Fig.la. Note 
that, the extra pattern consisting of two bright pixels can 
be used as a delimiter to denote the fraction point. Using 
light polarization we need three states of polarization. A 
possible encoding scheme would be to represent 1 by verti- 
cally polarized light, 1 by horizontally polarized light, and 
0 by light polarized at 45° as shown in Fig.1b. In this pa- 
per, we have chosen to represent the digit set with light 
intensity exclusively. 


Symbolic substitution consists of two phases: a recog- 
nition phase where the presence of a specific pattern is 
detected within a binary image and a substitution phase, 
where the present pattern is replaced by another pattern 
according to a predefined SS rule. Optical implementa- 
tion of the two SS phases have been investigated by several 


researchers|8,9,10]. 


be} 


1 


(a) Light intensity encoding of the digit set {i, 0, 1} 
H = Fa 


(b) Light polarization encoding of the digit set {1, 0, 1} 


ei 


Fig.1 Optical encoding of the signed-digit set {1,0,1} 


3 Optical SD Addition/Subtraction 


Given an SD number Y = yy_-1Yp-2°°* Yo-Y-1°°° Y-m; 
the algebraic value of Y is evaluated as : 


s=n— 


1 
Y= >> yx 2', where y; € {1,0,1} 


‘=—m 


(1) 


In this number system, there is no need for an explicit 
sign digit. In fact, the polarity of the most significant digit 
Yn—1 determines the sign of Y. Although the representation 
of an SD number is not unique, the zero (0) is uniquely 
represented with all zero digits. 


The addition of two SD numbers represented as X = 
En—1°** Lo.L-12_2...2-m and Y = Yn_1°** Yo-Y-1Y-2---Y-m 
results in an SD number S = sySyn_1°°+ $9.8—1$_2---S—m: 
Avizenis has defined three pipelined steps to perform the 
SD addition[3]. At the first step, 7; + y; = 2t;41 + w; is 
performed at the i-th digit position, for 1 = —m,...,n—1, 
where w; and t;4; are called the interim sum digit and the 
transfer digit respectively. These digits assume the follow- 
ing values: 


ifwi+t>1 
if w} +t; =0 
if wi +t <1 


' ! 
& =w,tt= 


(4) 


1 > be 


Figure 2 shows a totally parallel adder constructed by three 
types of optically implemented logic Cells (I, II, III), whose 
truth-table specifications are given in Table 1. There is 
no carry propagation beyond any two adjacent digits in 
the adder. Each sum digit s; depends on only six digits 
(zi, ys), (ti-1, ¥s~-1), and (2;-2,y:-2). Zeros are padded at 
the second and the third stages to preserve the same in- 
put/output format at each stage. 


Fig2. A totally parallel optical adder with 3 pipeline 


stages 


Example 1 below illustrates the addition of two SD 
numbers, x= (—0.125)10 = (1.111)sp, and Y = (0.375) 10 = 
(0.101)sp using the same 3-stage adder shown in Fig.2. 
The result is an SD number S = (0.25)19 = (000.110) gp. 
In this example, ¢ represents a padded zero. 


Signed-digit subtraction is performed by first negating 
the nonzero digits of the subtrahend and then performing 
the addition of the two operands. Since the negation oper- 
ation can be done in parallel for all digits, subtracting two 
SD numbers can also be done in parallel across all digits. 


Example 1(SD Addition) 


X = (1011),, =I 0 1 T= (70 
1 ifa;ty=1 1 ifz+y>1 aa : 
. eo = I = 0 1 1 = (3 
Wi = 0 if |je;ty|A1 tuy=¢ 0 ifa,t+ yi =0 (2) Y = (0101)aa : (3)10 
1 ifr;+y =1 1 ifz;+y; <1 Stage 1 0) 1 1 1 0 W; 
1 1 11 4 ti+1 
At the second step, w; + t; = 2t;,, + w; is performed to ; 
produce another pair of digits, w; and t,,: Stage 2 ; : : : ¢ ie 
i | tit 
1 ifwth=t ia Stage3 0 0 0 1 0 O & 
wi= 40 iflwttlAl tyr= 1 0 iflwe+t|A2 (3) ~~ Z = (00I00).4 = (-4)10 | 
1 ifwt+t=l 1 ifw;+t; = —2 | 


Using the truth tables in Table 1, we derive below a set 
of SS rules required for optical implementation of the SD 
addition. The search patterns of these rules correspond to 
the input combinations and the replacement patterns are 


The third step generates the final sum digtt, s;, as specified 
below: 
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the truth table entries as shown in Fig.3. Note that for 
Cell Type I and II, the replacement patterns are spatially 
displaced by one digit position, which accounts for the fact 
that the transfer digits (t; and tj, , respectively) are to be 


combined with the next higher-order digit in the addition 
process. 


On the surface, it seems that we need 3° = 27 SS rules 
corresponding to the nine entries of each of the 3 truth 
tables. However, a closer look at Table 1, reveals that the 
logic for the first and the second stages are very similar. 
Furthermore, if we pad the third stage output with 0, five 
of the nine entries become similar to stages 2 and 3. There- 
fore, the total SS rules needed for SD addition becomes 17. 
In fact, when the search pattern is all dark (both operands 
digits are 0) the replacement pattern is also all dark, which 
does not need any optical processing. Consequently, the 
actual number of useful rules for the SD addition becomes 
16. The subtraction needs one extra stage to perform the 
digit-wise negation. This stage requires two additional SS 
rules to negate the nonzero digits as shown in Fig.3d. 


Table 1 ‘Truth-table of three Cell Types used 


in designing the optical adder in Fig.2 


y; Type I Cell 


Type IT Cell _ 


ar Type III Cell 
Ne 


To illustrate the use of these SS rules, let us consider 
Example 1 in light of 2-D symbolic substitution. The in- 
put operands are optically encoded and stacked on each 
other as illustrated in Fig.4a. Next, the SS rules for Cell 
type I are applied to the input image. All nine input com- 
binations are searched and then replaced in parallel. This 
results in 3 successive new images as shown in Fig.4b-d, 
corresponding to the outputs of the 3 adder stages. 


a7 


(c) Additional rules for Cell Type Ill 


an am ap ow a= a= oe a | ‘wa 
‘> a—G 
Wa T18 ay toirig a, 


(d) substitution rules for signed-digit negation 


Fig.3 Optical symbolic substitution rules for signed-digit 
addition and negation 


In Fig.5 we show a schematic block diagram for an op- 
tical digital adder using the signed-digit symbolic substi- 
tution technique. Note that 17 rules are used. The optical 
implementation of each substitution rule is detailed in [10]. 
there are other methods that have been reported to imple- 
ment the SD addition optically(11,12]. 


4 Optical sD Multiplication 


The optical multiplication of two SD numbers X = 
En—1*** Lo-L-yL-2°°* Lem and Y = yn_-1 +++ Yo-Y-1Y-2 °° Y-m 
produces an SD product 
P = pon-1P2n-2*** Po-P-1P-2 *** D-2m+1P-2m, expressed as: 


P = (yo-1#X) x 2M ot (Yim # X) X 2°(5) 


where y; is the i-th multiplier digit, and * is the signed 
AND operation defined as follows for any z,y € {1,0, 1}: 


ifx=y=1 
if (x = 0) V (y = 0) 
if(c =1Ay=1)V(t#=1Ay=1) 


(6) 


rey = 


1 Oo me 


X =1.111 


Y =0.101 


1110 ¢ 


@¢101 0 


(b) Output after applying the SS rules of Cell I 


aca 
(AAA ZAZAZ 
 BPaRRE 

WA| |AZA_ IZ 


(c) Output after applying the SS rules of Cell II 


ana ee 
Pion ee one 


The desired sum 


(d) Output after applying the SS rules of Cell III 


S 


@00i%i1i1 0 


~ 


Fig.4 An SD addition example showing the use 
of symbolic substitution rules 


Optical feedback 
Substitution rule 1 (r | 
: \ 
 ) N 
e Combined 4 i : 


image 


Substitution rule 19 (rio 


Input 
image Replacement 
Fig.5 An optical adder(subtracter) symbolic 


substitution 


The notations, V and A are used to represent the conven- 
tional logical OR and the logical AND operations. The 
notation y; * X defines the following digit-wise operations: 


Y3 * X = Yj * Lp_-1, Yj * Ln—2,--- 5 Yy * Lm (7) 


We have previously developed a sequential algorithm for 
computing the product P in n+ m iterations using SD 
additions and right shifts[13]. In what follows, we present 
a parallel algorithm that computes the product of two SD 
numbers in log.(n+m) iterations, where (n+m) is the word 
length including n integer digits and m fraction digits. For 
clarity, we use integer numbers where the fractional length 
m = 0. The algorithm is composed of three steps : 


Step 1: Given two signed n-digit numbers, generate all 
n partial products concurrently, each having length n as 
follows: 


Pos =y3*X for j=0,...,n—-1 7 (8) 


where the term Po, is an n-digit SD number representing 
the j-th partial product. 


Step 2: Introduce the necessary shifts for each partial 
product. Each initial partial product Po, will be shifted 
qj digits to the left, corresponding to the weight factor 2? 
shown in Eq.5: 


Pog =yj*XX2?- for 7 =0,...,n—-1 (9) 


Step 3: Pairwise add all the partial products by means 
of an adder tree. With a total of n partial products at the 
leaves of the tree, the summation process takes log, n levels 
in the tree. At each level i, we perform n/2* SD additions 
in parallel : | | 


Py y = Py1aj-2 + Py-123-1 for j = 1,2,...,n/2*(10) 


The final product is produced at the root of the tree after 
log. n iterations. Step 1 and Step 2 are carried out in 
constant time. For a multiplier of length n, Step 3 requires 
log, n iterations. Since each SD addition takes constant 
time, then the multiplication of two n-digit SD numbers 
can be carried out in O(log, n) time. 


Example 2 below shows the parallel multiplication of 
two 4-digit SD numbers, X = (1.011)sp = (0.375)19 and 
Y = (1.011)sp = (0.875)10. In Step 1, we generate all 
the partial products using Eq.8. In Step 2, we introduce 
the necessary shifts. Finally, we add all the shifted partial 
products according to Eq.10), using a tree of SD adders to 
produce the final product P = 000.100101 = (0.546875) 40. 


Example 2: 
Step One : Generation of the partial products 
Poo = Y-3 * 1.011 
Pon = Y-2 * L= 1.011 


Po,2 = y-1 * X = 0.000 
Po,3 = Yo * 2 1.011 


Step Two: Shift the partial products 
(y_3 « X) x 2° = 0001011 
(y-2 * X) x 2? = 0010110 
(y_1 + X) x 2? = 0000000 
(yo * X) x 2? = 1011000 


Step Three: Summation of all the shifted partial products 


0001011 
> 00000111 
0010110 


0000000 
- > atioro00 
1011000 


Xx Y = Z = (000.100101)sp = (0.546875) 10 


000.100101 


The SD multiplication algorithm uses the signed AND 
operation (*) in generating all partial products simultane- 
ously, and a tree of SD adders to sum them up. Using 
Eq.6, we derive the SS rules needed for implementing the 
* operation as shown in Fig.6. Let us consider the op- 
tical implementation of the computations in Example 2. 
The multiplicand and multiplier are arranged in 1-D ar- 
rays as shown at the left of Fig.7. The multiplicand is 
shown horizontally and the multiplier is shown vertically. 
The generation of all partial products Po, for 7 =0,...,3 
is carried out in three stages. First, the multiplicand is 
spread out vertically by the astigmatic optics (represented 
by the cylindrical lens L1) to fill the 4 x 4 data plane M1. 
Similarly, the multiplier is spread out horizontally using 
the cylindrical lens L2, so that each digit of the multiplier 
is duplicated vertically 4 times to fill the 4 x 4 plane M2. 
Next, planes M1 and M2 are 2-D perfect shuffled [10] and 
then stored in an 8 x 4 plane R. For clarity, the optics re- 
quired for the 2-D perfect shuffle permutations is omitted 
from Fig.7. The 2-D shuffle permutations intended here 
affect only the row position, leaving the column position 
of the data unchanged. 


The resulting image, R, has alternating rows from M1 
and M2 such that odd rows contain the multiplicand and 
even rows contain a replicated digit of the multiplier. There- 
fore, row 1, row 3, row 5, ..., row n — 1 contain the mul- 
tiplicand X; and row 2, row 4, row 6, ..., row n, contain 
the replicated digits y1, yo, y3,..-,Yn—1 Of the multiplier re- 
spectively. In the third stage, plane R is replicated 9 times, 
each copy is used for applying one SS rule of the * opera- 
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Fig.6 Symbolic substitution rules for the SD AND 


tion. Therefore, every combination of the input operands 
is searched and is replaced in parallel. Finally, the output 
planes of all the SS rules applied are optically superim- 
posed. To this end, all the partial products have been 
generated in parallel as shown in plane P of Fig.8. Step 2 
of the SD multiplication algorithm involves spatial shifts. 


There are a variety of ways one can perform spatial shifts 
in optics{14]. 


Plane M1 
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planes M1 and M2 
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Fig.7 The spreading of the operands in Example 2 
for parallel SD multiplication 


The plane P, consisting of all partial products with ap- 
propriate shifts, is then fed to the adder described in the 
previous section in order to perform the last step of the 
multiplication algorithm. This is accomplished by apply- 
ing the SD addition rules for log, 4 iterations. 


In general , with a multiplicand of length n and a mul- 
tiplier of length m, the planes M1, M2, and P in Fig.7 are 
all m xn arrays, R is a 2m x n array, and the shifted P is a 
mx (m+n) array. It should be noted that, if the 1-D arrays 
which are used to input the operands are replaced by 2-D 
arrays and associated optics for spreading and shuffling, 
many operand pairs can be multiplied in parallel using the 
same set of SS rules. 
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Fig.8 Parallel generation of partial products using 


the SS rules for the SD AND operation 


5 Optical SD Division 


The conventional restoring and nonrestoring division 
methods require knowledge of the sign of the partial re- 
mainder for exact selection of the quotient digits. How- 
ever, in SD representation, the sign of a partial remainder 
is not readily available if several most significant digits 
are zero. This difficulty prevents the use of conventional 
methods for SD division. Robertson division method(15] 
was applied in [1] for SD number systems with radix r > 3. 
In that method, the quotient is represented in redundant 
form and the value of the next quotient digit is selected 
by comparing approximated values of both the divisor and 


the partial remainder. In searching for an effective divi- 
sion algorithm for SD numbers with radix r = 2, we have 
to achieve the following two goals: 


(1) The algorithm should overcome the difficulty of testing 
the polarity of the remainder after each iteration. | 


_ (2) The algorithm should make effective use of the 2-D 


parallel SD addition and multiplication schemes de- 
scribed in previous sections. 


An SD division algorithm satisfying the above goals is 
developed below based on the convergence approach|4,16,17]. 
Let us consider a dividend X and a divisor Y both SD frac- 
tions in normalized from, that is: 


1/2<|X|<Y <1 


(11) 


We want to compute the quotient Q = X/Y without a 
remainder. The algorithm consists of finding a sequence of 
multiply factors mo,m1,m2,...,m, such that Y x ([[j=¢ m;) 
converges to 1 (within an acceptable error criterion). Ini- 
tially, we set Xp = X, and Yo = Y. The algorithm repeats 
the following recursions: 


Mir = XX, Yui =x m (12) 
such that for a small n: 
¥x([[m)71, Q=Xx([[m) (13) 
s=0 


+=0 


The effectiveness of this method relies on the ease of com- 
puting the multiply factors m,’s, using only SD addition 
and SD multiplication operations. The recursive formula 


of Eq.12 can be rewritten as: 


Vier =¥i xm = f(%i) (14) 
We desire the function f(Y;) to converge to 1, starting from 
an initial value Yo = Y. Equation 14 can be rewritten in 
a polynomial form: 
f(¥i) — ¥; (15) 
Flynn has described several iterative methods [16] to 
enable such a polynomial to converge to a given value say k.. 


- We are interested only in the quadratic convergence as this 
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appears more convenient for optical realization. To achieve 
this, let us rewrite Eq.15 in a more general quadratic form: 


f(¥i) — Yi = (Vi — bi) (Yi — hz) = 0 (16) 


One of the roots of Eq.16 should be equal to the conver- 
gence limit 1. Krishnamurthy(17] has found that in order 
for Y; X m,; to converge quadratically to 1, the factors m,’s 
should be selected as: 


m;=2-—Y; provided that 0< Y; <2. (17) 


Equation 17 implies that the multiply factor for each 
iteration can be easily obtained as the two’s complement 
of the denominator Y;. In SD code, the arithmetic ex- 
pression 2 — Y; can be computed in constant time using 
SS rules for SD negation and addition. Since the conver- 


gence is quadratic, the accumulated denominator length is. 


doubled after each iteration. Hence for a desired quotient 
of length n, the maximum number of iterations needed is 
log, n. The convergence division of two SD numbers is for- 
mally specified below: 


SD Diviston Algorithm 


begin 
for 1 := 0 to log,n — 1 do 
m;:=2—Y;; 
Xi41 = Xi XM; | 
Yiu = Yi x mi; 
endfor; 


Q = Xog, n—-15 
end. 


Example $ illustrates the SD convergence division of 

= (0.10)sp = (—0.5)i9 by Y = (0.11)sp = (0.75)10. 
For a 16-digit precision, the algorithm generates the quo- 
tient after 3 iterations, Q = (1.111011101111111)sp = 
(—0.66664)1o. As for the optical implementation of the 
algorithm, each iteration of the SD division consists of 
three major operations : a pair of two SD multiplications, 
Yj41 = Y; X m; and X;4,; = X; X m;, and a two’s comple- 
ment operation m; = 2—Y;. Each SD multiplication can be 
' optically carried out as described in Section 4. The two’s 
complement is carried out by an SD negation followed by 


an SD addition. The subtrahend Y; is negated using the 
SS rules in Fig.3d. All nonzero digits of Y; are negated 
in parallel. The expression 2 — Y; then becomes 2 + Y;, 
which is computed using the SD addition rules in Fig.3a- 
c. The two SD multiplications required to generate Xj+1 
and Y;,1 can be computed concurrently by replicating the 
SD multiplication hardware into two channels, one for the 
numerator and one for the denominator as shown in Fig.9. 


6 Performance Analysis 


We estimate the potential speed of the optical arith- 
metic algorithms introduced in this paper. The analysis 
is based on the optical implementation models presented 
in previous sections. These estimates should reflect the 
state-of-the-art in optical computing technology. Our esti- 
mates cover both conservative and optimistic sides of the 
expected performance. 


The SD addition is performed in three stages. The to- 
tal time to perform each stage is attributed to the time 
needed : (1) to replicate the input image; (2) to propagate 
the image through the first hologram to provide the shifts; 
(3) to activate the optical NOR-gate array for inverting 
the superimposed image; (4) to propagate light through 
the second hologram for substitution; (5) to superimpose 
the output of all the rules; and (6) to feed back the in- 
termediate result. Therefore the total SD addition time is 
expressed as: 


(5) (6) 


-—_~ 
Tp) +2°T; (18) 


(3) 


(1) (2) (4) 
—~ NN 
Tada = 3( Tp - T + Toctiv + T > a 


where: 


T, = Propagation time of a light beam through passive 
optical devices such as lenses, beam splitters, holo- 
grams, etc. 


T; = Feedback time (light propagation through the feed- 
back interconnect) 


Txctiy = Response time of an optical NOR-gate array used 
for inversion and thresholding. 


Example 3: SD division steps based on repeated multiplications 


Iteration | Multiply Accumulated Accumulated 
step factor Conomnator — or 


+=0 mo = 2—-— Yo = Yo X mo = Xo X Mo 
(ite (olli0),5 
| | = (1.25) 10 | = (0.9375)10 | = (—0.625)10° | 


m, =2-—Yox mo 


Y2= 


Yo X Mp X my 


X= Xo X Mo X My 


(1.0011)sp 
| | = (1.0625) 10 


mz =2—-—YoxX mo xX my 


(1.0000001) sp 
| (0.99509) 10 


Y3 = Yo X mo X mM X Me 


(1.11101110) gp 
| (—0.66406) 10 | 


X3 = Xo X MX My X Me 


(1.0000001) 5p 
= (1.00390625)10 


(1.0000000000000001)spn | Q = 
Y3 — 1 
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(1.111011101111111) sp 
= (—0.6666..)10 


SS Rules 
for negation 


SS Rules for 
addition 


Pine Xie 
N-Channel for 


Numerator 
multiplication 


Yi41 
D-Channel for 
Denominator 
multiplication 


Fig.9 An optical convergence divider using two channels 


of optical multipliers 


The numbers over the braces in Eq.(18) indicate the 
times needed to accomplish each subtask. T, and Ty can be 
approximated by 0.1 nsec/|14] (light propagates at 1 ft/nsec 
in free space). The dominant limitation to speed is the 
switching time of the optical NOR-gate array, representing 
the only active element in the addition path. Therefore, 
the total SD addition time would be Toga + 3Tuctiy. An 
n-digit SD addition requires (n + 1) x 4 pixels, where the 
factor 4 is introduced by the encoding scheme used (2 light 
pixels for each digit). Therefore, for an optical gate. array 
of size | x | pixels and a switching time 7, the optical SD 


adder is able to perform ©, n-digit additions per second, 
where: 


Lxl 


a= ar x ((n ti) x4) (SD additions/sec) (19) 


Optical gate arrays of very small sizes ( say 2 x 2 to 
5 x 5) have been recently demonstrated [18]. These ar- 
rays offer the possibility of achieving a 10~!? sec switch- 
ing time. However, these optical gate arrays can not be 
used in a practical system due to their small size and high 
power consumption. If we were to use a commercial spa- 
tral light modulator (SLM) such as the Itquid crystal light 
valve (LCLV) with a 500 x 500 pixel resolution and 20 ms 
switching time, we can perform about 63 x 10° 32-digit SD 
additions per second. This yields to an average of 1 /9. 
= 15 x 10~° sec per SD addition. This speed is not much 
faster than today’s fast adders. However, faster SLMs are 
being produced in research laboratories|18]. If the response 
time of the SLM were reduced to 0.01 psec, a 500 x 500 
resolution will bring the 32-digit SD addition time down 
to 1.5 x 107'* sec (0.15 ps), which will represent 104 times 
improvement over electronic adders of the same size. 


Referring to the optical implementation model in Sec.4, 
the SD multiplication time is attributed to the time needed 
: (1) to generate the partial products; (2) to shift them; 
and (3) to add up the shifted partial products. This time 
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is expressed as: 


py ee, eee 
Tmut = Ts + 4T, + Tactiv + T, + Tada X log, n (20) 


(1) (3) 


where T, represents the time needed to spread and to shuf- 
fle the operands. This time corresponds to light propaga- 
tion through passive devices which can be estimated by 
0.1 nsec. Since T, ~ T, << Toactin and Toag © 3T activ, 
hence Tm * Tactiv(1 + 3log,n), where n is the precision 
of the multiplier. An n-digit SD multiplication requires | 
4 x (n x 2n) pixels, where the factor 4 is related to the 
light encoding of the digit set {1, 0, 1}. Using an SLM 
with | x 1 pixel resolution and 7 switching time, we obtain 
the number 0,, of n-digit multiplications performable per 
second: 


Lxl 


= ee ea 
4x (n x 2n) x 7 x (1+ 3log, n) 


(21) 


If we were to use standard off-the-shelf SLM (LCLV), 
there could be 96 SD multiplications per second. This 
corresponds to a speed of 1/@,, = 10 msec per one 32-digit 
SD multiplication. This looks very slow. However, if the 
switching time of the SLM were reduced to 0.01 psec, the 
32-digit SD multiplication time would be reduced to 5 nsec, 
which is 100 times faster than today’s fastest electronic 
multipliers of the same word length. 


Consider the optical implementation shown in Fig.9, 
the time required to perform one iteration of the SD di- 
vision consists of the time needed : (1) to generate the 
multiplicative factor m; ; and (2) to produce the next nu- 
merator and denominator X;41,Yji1. This time is then 
multiplied by the logarithm of the fraction length to ob- 
tain the total SD division time Ty: 


(1) (2) 
y een 
Taw = (4Tp + Tactiv + Tada + Ty + Tmuie + Ty) x log, n(22) 


Substituting Tyaqg and Tinue in Eq.22 with Eq.18 and Eq.20 
respectively, we obtain Tuy © Toactiy log, n(5+3 log, n). An 
important feature of the SD division algorithm is that sev- 
eral dividends can be divided simultaneously by the same 
divisor. This is due to the fact that the multiply factors 
and the convergence rate depend only on the magnitude of 
the divisor. An n-digit SD division requires 4 x (n x 2n) 
pixels to hold the accumulated numerators or denomina- 
tors ( assuming that we are truncating the intermediate 
products by n digits after each iteration). Therefore, for 
an optical gate array of | x | resolution and 7 switching 
time, we estimate the number of SD division per second 
as: 


Lxl 


Se ec ee 
eo (n x 2n) x 7 x log, n(5 + 3 log, n) 


(23) 


For a resolution / x / = 500 x 500 and a switching time 
T = 0.0lpsec, the time needed for a 32-digit SD division 
would be 1/@, which is around 30 nsec, a rather impressive 
figure that no existing electronic divider can achieve. 


In Fig.10a, we plotted the optical addition, multiplica- 
tion and division times against a wide range of the optical 
clock rate (or the inverse of the optical switching time r). 
The speedup of the optical arithmetic operations over their 
electronic counterparts is plotted in Fig.10b. We fixed the 
resolution of the optical gate arrays to 1x1 = 500x500, and 
the precision n = 32 SD digits. For the speedup curves, we 
used 20 nsec, 500 nsec, and 2 usec for 32-bit electronic ad- 
dition, multiplication and division, based on current elec- 
tronic technology [19,3]. Both scales are in logarithm with 
base 10. 


Compute time (nsec) 

© = Addition time 

* = Multiplication time 
o = Division time 


1/r 
100Ghz 


10Mhz 


100Mhz 1Ghz 10Ghz 


Fig.10a Optical compute time as a function 


of the clock rate (1/r) 
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Fig.10b Potential speedup of optical over 
electronic arithmetic computations. 


7 Conclusions 


The SD representation allows parallel addition to be 
performed in constant time. The execution times of the 
proposed SD multiplication and SD division algorithms are 
both proportional to log,n, where n is the length of the 
multiplier and of the divisor. We have presented the op- 
tical setups to achieve 2-D optical symbolic substitution. 
The carry-free nature of SD arithmetic matches well with 
the space-invariant property of optical symbolic substitu- 
tion. 


We have introduced two new sets of SS rules for im- 
plementing SD arithmetic in optics. The optical imple- 
mentations are based on available optical hardware. We 
have assessed the performance of optical arithmetic based 
on the state-of-the-art optical and electro-optical technolo- 
gies. We conclude that the speedup over electronic coun- 
terparts is rather limited due to the slow switching time of 
today’s 2-D spatial light modulators. 


If the switching time of the optical gate arrays were re- 
duced to nanosecond range, we could perform 32-digit op- 
tical addition, multiplication and division with a speedup 
ranging from O(107) to O(10°) over existing electronic coun- 
terparts as shown in Fig.10b. Therefore, the potential 
of building future supercomputers with optical arithmetic 
units looks very promising and encouraging. The algo- 
rithms developed in this paper are meant to prepare com- 
puter designers for the new challenges brought over by op- 
tical technology. 
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ABSTRACT 


Due to its tendency towards large and unpredictable 
amounts of interprocessor communication, parallel 
logic simulation places enormous demands on the 
performance of both individual processor elements 
and interprocessor communications. To explore the 
relative importance of processor and communications 
speed and to compare the merits of different architec- 
tures for this application, results are offered from the 
simulation of a number of test circuits on models of 
five parallel architectures. Three different schemes 
are uSed to partition the circuit representations across 
processors, and both 4 and = 16 _ processor 
configurations are considered for each architecture. 
The relative cost of device evaluation and signal com- 
munication is also varied. The five architectures 
examined are: a parallel processor with a single inter- 
processor communications bus, a ring of processors, a 
simple processor array with nearest-neighbor connec- 
tions, a hypercube, and a processor array with 
crossbar communications. The results are compared 
both to the single processor case and to the ideal 
parallel case, and they indicate that the performance 
potential of parallel event driven logic simulation at 
this level is questionable. 


INTRODUCTION 


The march towards ever larger and faster computer 
systems has continually outpaced the rate of advances in the 
‘computer aided design (CAD) tools used to develop them. 
Nevertheless, CAD tool developers struggle to keep up by 
employing combinations of three basic tactics. The first tactic 
seeks to reduce the size of the problem, either through hierarchi- 
cal modeling of computer systems or by considering only small 
portions of the design at a time. The second tactic centers on 
the discovery of more efficient algorithms. And the third tactic 
involves the exploitation of parallel architectures. 


Approaches to parallel logic simulation can generally 
be divided into two categories. Those in the first category pur- 
sue speed gains by breaking the algorithm into pieces that are 
then executed on separate processors. Although parallel logic 
simulators of this variety have been successfully implemented in 
hardware,! their performance potential would appear to be too 
severely restricted by the limited parallelism inherent in tradi- 
tional simulation algorithms to be of lasting interest. 
Approaches in the second category attempt to leverage the paral- 
lelism evident in the behavior of real circuits by partitioning the 
circuit representation being simulated among several processors. 


At first blush, it seems reasonable to speculate that 
the performance of parallel logic simulators based on circuit par- 
titioning need not degrade markedly as the size of circuit 
representations increases. Two factors invalidate this specula- 
tion. First is the fact that in traditional event driven logic simu- 
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lators, it is mecessary to maintain the same simulation time 
across all processors, requiring all processors to complete their 
work at a given time unit before any may proceed to the next 
time at which there is activity. The second factor is the time 
expense incurred by communicating signal values for devices 
modeled on one processor needed as input to devices on another 
processor. 


The goal was to explore the potential of parallel logic 
simulation based on circuit partitioning in light of these con- 
siderations. To aid in comparative analysis, an instrumented 
event driven simulator was employed to model performance for 
five different parallel architectures using a set of 11 test circuits. 
The five architectures examined were: a parallel processor with 
a single interprocessor communications bus, a ring of processors, 
a simple processor array with nearest-neighbor connections, a 
hypercube, and a processor array with crossbar communications. 
Three circuit partitioning algorithms were tried in each case to 
examine the sensitivity of simulator performance to this task. 
And finally, four different sets of device evaluation time to sig- 
nal transmission time ratios were used to determine the relative 
performance criticality of these parameters. 


PARALLEL LOGIC SIMULATION 


In this section, we define the parallel logic simulation 
algorithm to be used in the analysis below. As mentioned ear- 
lier, parallel techniques can generally be divided into those that 
distribute the algorithm among several processors and those that 
distribute the data. The limitations on useful decomposition of 
the basic logic simulation algorithm renders approaches in the 
first category unworkable for large scale parallelism. Therefore, 
the discussion at hand will be restricted to approaches that distri- 
bute the circuit representation among processors. 


Circuit models used in logic simulators are typically 
composed of structure instances that represent individual device 
occurrences with electrical connectivity indicated via pointers 
from structure to structure. This representation is split among 
processors for parallel logic simulation using one of the parti- 
tioning algorithms presented in the following section. Once 
every device structure (and hence every device model) is 
assigned to a processor, the circuit representation is loaded onto 
the appropriate element. Devices that drive inputs not found on 
the same processor are tagged so that when their outputs change, 
sink devices on other processors can be notified of the change 
via a signal change message. 


Assuming a standard one-pass simulation algorithm, it 
is possible to develop simple equations for determining perfor- 
mance. For simulation on a single processor, the time for a 
simulation run is given by 

sim_passes 


a = > evalSpass. * leval- 
| 


(1) 


Ignoring the negligible synchronization overhead, the general 
equation for simulation on a machine with multiple processor 
elements (PEs) is given by 


sim_passes 


bosim = > MAX es aH ’ a (2) 


i=l 
where the maximum communications time is architecture depen- 
dent and will be discussed in due course, and the maximum 
evaluations time at each pass is equal to the product of the-larg- 
est evaluation count found on the processors and the time per 
evaluation. In qualitative terms, Equation 2 states that the time 
per simulation pass is equal to the maximum time required by 
any of the individual processors and that this time is determined 
by the maximum of the communication and evaluation times. 
And, of course, the total simulation time is equal to the sum of 
the times spent on all of the passes. The single evaluation time 
added to the communications time simply indicates that, if the 
time consumed by a given processor is dominated by communi- 
cation then after the last signal for the current pass has been 
received, a final evaluation must be performed to complete the 
pass. Note that for our analysis, all evaluations are assumed to 
take the same amount of time. This assumption is valid for most 
hardware implementations using simple look-up tables for 
evaluations,” and is a reasonable approximation in general. With 
these equations, we can calculate the speed-up of parallel logic 
simulation directly from 


b sim 


speed—up = (3) 
_ Spsim 
Equ2 ion 2 will be elaborated for each of the five architecture 
models presented below to account for the effect of the different 
architectures on communications performance. 


CIRCUIT PARTITIONING 


The goal of circuit partitioning is to assign devices to 
processors in a manner which maximizes the resultant simulation 
speed-up. As shown in Equation 2, this speed-up is dependent 
upon message transmission time and gate evaluation time. The 
optimal partition will produce the minimum number of messages 
as well as an even distribution of evaluations at each processor 
for each simulation pass, however finding such a partition is a 
dramatically more expensive process than the simulation task 
itself. Therefore, two heuristics are used that have been shown 
to produce satisfactory results for a variety of circuits: input 
cone and output cone partitioning. These schemes involve plac- 
ing occurrences into the circuit block to which they have the 
greatest attraction, that is, the block that contains the greatest 
number of occurrences in their input or output cone. In addition 
to the two heuristic schemes, random assignment partitioning is 
also included as a baseline. Since signal activity tends to be 
clustered for many circuits, this method offers the potential for 
relatively high degrees of processing concurrency. However, the 
advantage is frequently offset by large numbers of message 
transfers. 


Clearly, the success of any given partitioning scheme 
is dependent on the relative weighting of communication and 
evaluation costs. Since partitioning by cones seeks to minimize 
interprocessor communications by grouping connected devices 
on the same block, the performance of these approaches hinges 
on the assumption that communications costs dominate simula- 
tion time. 


THE PARALLEL ARCHITECTURES 


In this section, we describe the five architectures 
modeled in our analysis. To aid in relative comparisons, the 
Same performance assumptions are made in each case. To wit, it 
‘is assumed that direct point-to-point signal change messages 
require a constant amount of time, and that processors do not 
buffer these messages. This assumption implies that a message 
being sent through a single intermediary processor will require 
One message time unit on the originating processor, two units on 


the intermediary processor, and one unit on the destination pro- 
cessor.. 
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Processor elements are assumed to be identical gen- 
eral purpose machines capable of performing both the device 
evaluation and local time-management tasks. The time con- 
sumed during a device evaluation is assumed to be constant and 
includes scheduling overhead so that in the absence of any mes- 
sage traffic, the time per simulation pass on a processor simply 
equals the product of the evaluation count during the pass and 
the time per evaluation. 


Finally, it is assumed that none of the architectures 
possess any global memory. The local memory on each proces- 
sor element contains both a unique portion of the circuit being 
simulated and the simulation time management structures 
required during simulation. Devices that drive signals on other 
processors are tagged with the data needed to route the informa- 
tion to its destination. 


The data presented below were gathered from a 
heavily instrumented event driven logic simulator operating on 
test circuits partitioned during preprocessing. Partition blocks 
were assigned to processor elements randomly. Message traffic 
figures obtained from the simulations are exact and represent a 
complete picture of the expected interprocessor communications 
load for the 11 test cases used in our analysis. In taking this 
approach, it has proven remarkably easy to model new architec- 
tures and to focus on the considerations currently of greatest: 
interest, namely, the effects of processor count and interconnec- 
tion architectures. 


Single Bus Architecture 


A bus supports the transmission of only one message 
at a time, but arbitration is assumed to occur in parallel and is 
therefore not considered in overall simulation time. From Equa- 
tion 2 it is clear that the time consumed by a processor per simu- 
lation pass is roughly equal to the greater of the communications 
time and the evaluation time. The evaluation time per processor 
is independent of the interconnection architecture in use, but the 
communications time per processor is highly dependent on this 
factor. For the single bus architecture, the communications time 
per pass is given by 


(4) 


COM Ipass . 


_— * * 
= Mmsgspr*teom + 1*teva 
single bus PEs 8 PE; <— B 


which states simply that, since all messages are transmitted via 
the same channel, the communications time is determined by the 
total count of interprocessor messages. As mentioned earlier, the 
single evaluation time is added to account for the work required 
after the last signal change message is received. For the single 
bus architecture, the maximum and average path for messages is 
equal to one, but this one channel is likely to be very busy. 


Ring Architecture 


For our ring model, we assumed unidirectional mes- 
sage flow. Assuming that each message flows in a clockwise 
direction until its destination processor is encountered, and 
assuming that messages are not buffered, the communications 
time per simulation pass is 


com t, 
POSS ying 


= MAX {r msgspz +dest MSgSpp+teyai2*via meant (5) 
and the total time per pass is again the greater of this value and 
the largest of the individual evaluation times. The assumption 
that messages going through intermediary processors consume 
two message time units is rather pessimistic, but is made to 
maintain consistency with the other models. If there are N pro- 
cessor elements, then 


(6) 


message pathy, = N-1 
and 


message pathgy, = 


= (7) 


The average path assumes a random distribution of message 
traffic. For 4 processors, the longest path visits 3 processor ele- 
ments and the average path visits 2, and for 16 processors, the 
longest path is 15 and the average path is 8. 


Array Architecture 


A simple array architecture has up to four nearest- 
neighbor connections per processor element. In the model, there 
are frequently multiple paths for a message from a given source 
element to its destination, so two approaches were used for 
selecting paths under these circumstances. In the first approach, 
the route was selected randomly; and in the second, the link at 
each step with the lowest cumulative message traffic was always 
chosen. The relative worth of these approaches will be dis- 
cussed along with the rest of the results in the following section. 
The communications time per processor element for each simula- 
tion pass is the same as that for the ring and is given in Equa- 
tion 5. 


A square array of 1 elements will have a longest path 
for interprocessor communications of 


message pathsax = 2* [vas (8) 
and an average path of 
message pathay = ne (9) 


3 


For the 4 processor case, the longest path is 2 and the average 
path is 1.33. For 16 processors, these values become 6 and 
2.67, respectively. 


Hypercube Architecture 


A hypercube architecture with eight processor ele- 
ments has dimensionality of k=log,(n)=3. The most common 
scheme for routing messages in hypercube architectures uses a 
fixed routing scheme based on the difference between the bit 
encoded destination processor identifier and the current location 
of the message.” However, to maintain comparable communica- 
tions schemes, the approach for message routing used in the 
array model is also employed for hypercubes. The results are 
roughly equivalent to fixed routing for the random routing case, 
and better than fixed for the balanced case. Each processor ele- 
ment in a hypercube of dimensionality k has k interprocessor 
communications paths. A hypercube with n elements will have a 
longest communications path of 


message pathmax = 1082(N) (10) 
and, if k is the dimensionality, an average path of 
; ke} 
message pathgy, = ae (11) 


For the 4 processor case, the cube collapses to an array. For the 
16 processor case, the longest path is 4 and the average path is 
Delds 


Crossbar Architecture 


A crossbar switch architecture contains point to point 
connections from each processor to every other processor. In 
this case, the communications time consumed per processor in 
each simulation pass is given by 


com t 
POSS vossbar 


= MAX fre msg spp dest msgspp +) “nah (12) 


RESULTS AND ANALYSIS 


This section presents results for parallel logic simula- 
tion on the five models described above. 11 test cases were run 
for each model using threé partitioning schemes and four 
different assumptions about the relative cost of interprocessor 
message transfers and device evaluations. The first circuit was a 
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simple ALU bit-slice; the rest were obtained from the 1985 
International Symposium on Circuits and Systems. 


The four evaluation to communication cost ratios used 
were 1 to 0, 3 to 1, 1 to 1, and 1 to 3. A ratio of 3 to 1 implies 
that a single evaluation is assumed to require three times the 
time required to complete a single message transmission between 
two processors. The 1 to 0 ratio is intended to model an ideal 
Situation, i.e., a system that can transmit messages instantane- 
ously. 


Table 1 lists average ideal concurrency figures for 4 
and 16 processor systems using all three partitioning schemes. 
Each entry indicates the attainable speed-up using the applicable 
partitioning approach and processor count if messages could be 
sent in zero time. As such, these data address how well the 
three partitioning schemes evenly distribute evaluation work 
among processors. A concurrency figure equal to the number of 
processors could occur only if the same number of evaluations 
are performed on each processor during every simulation pass. 
The average figure of 3.471 for random partitioning into 4 
blocks represents a peak processor usage efficiency of 87%. For 
the 16 processor case, the efficiency drops to 66% for random 
partitioning. 

Although the random partitioning exhibits clearly 
superior evaluation concurrency behavior, its appeal diminishes 
greatly when message transmissions are weighed into the figures. 
Tables 2 and 3 show the performance improvement of the other 
two partitioning schemes as the message transmission time is 
weighed in more heavily. In qualitative terms, it is not surpris- 
ing that cone partitioning schemes generate fewer interprocessor 
messages; their entire goal is to group signals with the devices 
they drive. 

The results also indicate that, even for as few as 4 
processors, it is quite possible to actually slow down a logic 
simulator by implementing it in parallel. Regardless of how the 
circuit is partitioned, if an evaluation takes one third the time of 
a message transfer, parallel simulation on the bus or ring archi- 
tectures will on average result in a speed-up of less than unity 
(i.e., an overall decrease in performance). For a "hardwired" 
simulation engine with evaluation routines based on high speed 
table lookups, ratios of this order are not at all unlikely. For 
output cone partitioning, if message transfers are three times as 
fast as evaluations, the resultant speed-up of approximately 3 
implies that 75% of the ideal result is achieved. 


The crossbar interconnection results shown in Table 2 
illustrate the attraction of the architecture where feasible. How- 
ever, if interconnect usage efficiency is considered, then the 
results are less impressive. A 50% increase in processor links 
over the array architecture yields average performance improve- 
ments of less than 20% in all cases. 


Table 3 presents results for the test cases executed 
using 16 processor models. For the bus and ring architectures 
with a 1 to 3 ratio, concurrencies of less than zero are produced. 
The array and hypercube results show that for higher processor 
counts, the message routing scheme has a greater influence on 
performance than was the case for 4 processors. 


It is interesting to note that, for the case in which 
message transfers are assumed to consume three evaluation 
times, only 71% of the test runs resulted in a performance 
increase over the single processor case. Also worthy of mention 
is the modest size of performance increases from the 4 to 16 
processor systems for each of the architectures. Four times as 
many processors increased the average performance of the single 
bus architecture by at most 80%. For the ring, the larger system 
improved the speed-up by only 62%. A 400% increase in pro- 
cessor elements brought about up to a 162% increase in overall 
speed for the array. The hypercube achieved a 190% overall 
speed-up for the 16 processor system relative to the smaller 


configuration. Finally, the crossbar managed a 199% increase in 
going from 4 to 16 processors. Of course, these figures are all 
the more disappointing in light of the significant increase in 
interconnect hardware which accompanies the expansion of these 
systems (excepting the single bus system) from 4 to 16 proces- 
sors. 


CONCLUSIONS 


We have compiled detailed modeling data for event 
driven parallel logic simulation on five architectures varying both 
circuit partitioning schemes and processor and interconnect per- 
formance. The results indicate that performance is extremely 
sensitive to both the partitioning scheme and the interprocessor 
communications speed. These two factors are obviously related: 
when the average device evaluation time dominates message 
transmission time, random partitioning produces the best results. 
But, as the relative cost of message-transfers rises, the two cone 
based partitioning schemes which seek to minimize message 
traffic surpass random partitioning. This seems to imply that 
simulation algorithms such as fault simulation and high level 
simulation that can exploit this relationship show significant 
potential for parallel applications. 


At the same time, the results are not really very 
encouraging. In order for the 16 processor crossbar architecture 
to gain a factor of 10 performance advantage over a single pro- 
cessor, it was necessary to add 15 processors and 120 intercon- 
nection links. If a more likely cost ratio of unity is used, the 
speed-up is halved. Obviously, these results do not bode well 
for parallel logic simulation of gate level design representations. 


This work has led us to focus on a search for simula- 
tion algorithms better suited for parallel processing. For exam- 
ple, if simulation were carried out at a higher level so that the 
time cost of evaluations could be substantially increased relative 
to communications costs, results much closer to the ideal con- 
currency figures of Table 2 might be possible. Other areas of 
interest in ongoing work include configuration heuristics for 
parallel simulation. Which partitioning scheme is most appropri- 
ate for a given circuit? What number of processor elements will 
yield the briefest simulation time? 
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Abstract 


This paper describes a parallel computation model based on a 
data/control flow notation which consists of separate but related 
sub-models of data path and control. The data path is formulated 
as a directed graph. The control structure, on the other hand, is 
modelled as a Petri net. This model is used for specification and 
synthesis of digital hardware with a high degree of concurrency and 
parallelism. The semantics of the proposed model is defined in 
terms of its interactions with the environment. That is, two pieces 
of hardware are considered to be semantically equivalent if they 
interact with an environment in the same way. This allows 
manipulation of the internal structure of the hardwares to improve 
performance as well as reduce cost. A set of transformations for the 
model which preserve its semantics is presented. A sequence of such 
transformations can be used to move a design from an abstract 
description to a final implementation. 


1. Introduction 


One approach to the design of complex digital hardware for VLSI 
implementation is to use top-down synthesis technique. A synthesis 
approach starts the design with an abstract specification and refines 
it step by step towards a physical implementation by adding details 
[6]. Automated synthesis of parallel systems requires a parallel 
computation model to support the description of the system being 
designed. Such a computation model must be able to express the 
existence of multiple hardware resources for data _ storage, 
computation, and communication. At the same time, it must be 


able to represent the existence of multiple control flows and. 


synchronization schemes. 


This paper describes a parallel computation model in which a data 
path is used to represent the available hardware resources for data 
manipulation. The organization of this set of hardware resources to 
perform the prescribed computation is defined as a control structure 
which specifies the partial ordering of the given operations. Those 
operations which are not ordered, i.e., do not dependent on each 
other, can be carried out in parallel by physically distributed 
hardware resources. The control structure is formulated as a Petri 
net in the proposed model. 


One important task of a hardware synthesis process is to perform 
design optimization. As such, there must be as much freedom as 
possible to alter parts of the control as well as data path in ways 
that do not change the behavior of the given system. For this 
possible, we must be able to characterize the behaviors of a system 
and define precisely the concept of equivalent systems. The 
semantics of the proposed model is defined in terms of its 
interactions with the environment. That is, only the external events 
are relevant to the semantics of the system. In this way, the 
internal structure of the digital system can be change without 
changing its semantics. The system’s interaction with its 
environment is in turn defined based on two factors. First the 
functional relationship between each output variable and _ its 
relevant input variables must be the same; secondly the temporal 
relationship between input/output operations should not be 
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different. This definition differs from other approaches which 
consider only the input/output functional relation in terms of the 
values being exchange between a system and its environment. 


Most of other parallel system models have concentrated only on the 
synchronization aspect, or the partial ordering of communications, 
of parallel systems [5], [2|. For example, a Petri nets could be used 
to represent event/condition system where a partial ordering of the 
occurrence of events is specified but the contain of the events are 
ignored [5]. CCS (a Calculus for Communicating Systems) defined 
by Milner [2], on the other hand, models the occurrence of 
potentially concurrent events as a shuffle (interleaving) of those 
events; i.e., the events can occur in either order. As such, it has the 
composition explosion problem. That is when several agents are 
composed together, the possible number of behaviors are of the 
exponential order of the number of agents. Consequently the 
complexity of the behavioral expressions is also increase 
exponentially. Further, the computational aspects are also 
abstracted away in CCS. Our model, on the other hand, model both 
the computations and their synchronizations, which are necessary 
for synthesis of hardware systems. 


Another description model for hardware synthesis which also used 
external events to characterize semantics of a system has been 
proposed by McFarland [1]. However, it uses regular expression to 
formulate the event structures. Consequently it is difficult to deal 
with concurrent event structures. We are more interested in 
synthesis of algorithms (finally implemented as hardware) which are 
expressed as partially ordered events. 


2. Definition of the Computation Model 


The proposed computation model is based on the concepts of data 
flow and control flow. The data flow part is modelled as a data 
path, which represent the existence of multiple hardware resources 
to perform different operations. The control flow, on the other 
hand, dictates the partial ordering of these operations. In a parallel 
computation, there exist more than one control signal streams 
which move on with their own paces and synchronize with each 
other only when necessary. The partial ordering relationship 
between different set of operations is modelled by a Petri net 
notation. 


Definition 2.1: A data path, D, over an algebraic structure is a 
five-tuple, D = (V, I, O, A, B), 


where V = {V1, Ve,..., Vn} is a finite set of vertices each of which 
represents a data manipulation node; 


I= 1(V1) U I( Ve) U ...UI( Vn) with I( Vj) = the set of input 
ports associated with vertex V3; 

O = O(Vi) U O( Ve) U ...U O( Vn) with O( Vz) 
output ports associated with vertex Vj; . 

P = IU O is the set of ports; it is assumed that IN O = @. 
ACO xI={(0,} | O0€ O(V), TE I( V3), ij = 1,2,...,n }, is 
a finite set of arcs each of which represents a connection from an 


output port of a vertex to an input port of another vertex or the 
same vertex; 


the set of 


B : O-OP, is a mapping from output ports to operations. 
OP = {OP1, OPe, ..., OPm} is a set of operations which define 


the functional relation between an output port of a vertex and 
its input ports. The set of operations are divided into the 
sequential set SEQ and the combinatorial set COM. 


Intuitively, a data path is a directed graph with each node having 
possibly multiple input ports and output ports. The nodes are used 
to model data manipulation units, for example data storages, 
arithmetic operators, or communication channels. The arcs are used 
to model the connections of these data manipulation units. 


Therefore, the above definition is concerned mainly with the 
structure rather than the function of the data path. How the data 
path is used to perform computation is not explicitly defined. We 
assume that there exists an implicit interpretation of the underlying 
algebraic structure which supports the computation rules. Such an 
algebraic structure should consist of a domain of values for 
constants and variables, an assignment of values to the constants 
and a function definition for each operator. This algebraic structure 
is not considered here as it does not directly affect the basic 
formulation of the model. Further, to define the semantics of the 
system independent of any particular interpretation makes it 
possible to cope with different implementation environments. 
However, we assume that some modules exist in a module library 
which can perform the defined operations of the data path. 


The notion of ports here is used as a basic abstraction of the 
input/output behavior of a data manipulation unit and thus 
separates the implementation of the operation associated with the 
vertices from the specification. The operation of the vertices are 
defined only by the relation between the output ports and the input 
ports. It is assumed that the output port will present a value which 
has the given relationship with the values present in the input 
ports. 


Definition 2.2: A data/control flow system, TI, is a seven-tuple, 
r = (D,S, T, F, C, G, Mo) 


where D = (V,I, O, A, R) is a data path; 


S = {5S1, Se, ..., Sn} is a finite set of S-elements, or control states 


(places); 


T = {71, Te, ..., Tm} is a finite set of T-elements, or transitions; 


F C(S x T) U(T x S) is a binary relation, the flow relation. 


c:sS—- 2A is a mapping from control states to sets of arcs of 
the given data path; an arc Ai is controlled by a control state 5S; 
if Ai € C(S)). 


G:O- gt is a mapping from output ports of data path 
vertices to sets of transitions; a transition Ji is guarded by 


output port O; if Ti € G(O;). 


Mo: S — { 0, 1 } is an initial marking function. 


The definition of the data/control flow model is based on the 
marked Petri net notation. The Petri net S-elements are used to 
capture the control state concept. When a control state holds a 
token, a control signal will be generated to control the 
corresponding arcs in the data path specified by the control 
mapping function C. As there could be more than one control state 
which holds tokens, there exist multiple control signals in the 
systems. Further, the flow of these control signals (the temporal 
relation between signals) is defined by a partial ordering structure, 
which is captured by the flow relation F. To express the control 
flow being affected by the results of some internal computation, we 
must be able to use conditional signals (as results of some 
computation) to affect the control flow. For this purpose, the 
guarding condition concept is introduced into the Petri net 
notation; a transition may be guarded by a condition produced from 
the data path represented as the output port of some vertices. 


Definition 2.3: For a data/control flow system TI = (D, S, T, F, 
C, G, Mo): 


1. X = SUT is the set of control structure elements. 
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2. F* = {F” | n EN" }, where F° = identity and F” = Fo 
F”” for nEN" , is the transitive closure of F. 

3. Six S; iff (Si, 5) EF; «=(>)" 

4. «x = >U<. S$: and Sj are said to be in sequential order if 
Six S5. . 

5. || = (SxS \ a). Si and 5S; are said to be in parallel order if 
Si {| Sj. 


The data path consists of two kinds of elements, the nodes together 
with their ports representing the data manipulation units and the 
arcs representing the connection between those units. Each arc is 
controlled by, or said to be associated with, some control signals 
coming from the control Petri net. We can also associate the data 
manipulation units with the control signals by the following 
definition. 


Definition 2.4: Vk is said to be associated with 5; if 
A(o,7) € Al (tE I(Ve) ) M ((0,2) € C(53)) |. 


By this definition, only the input ports of a vertex are significant 
for the associative relation. The output ports are irrelevant here 
because an output port can send data to more than one place at a 
time without resulting in conflicts. A single input port, on the other 
hand, cannot receive signals simultaneously from more than one 
resource. 


The set of vertices and arcs associated with a control state S forms 
a subgraph of the data path graph. This graph is called the 
associated graph of S. 


Definition 2.5: The arcs and vertices associated with control state 
Si, denoted by ASS(S:), are said to be active under Si. 


Intuitively, the arcs representing the data paths (e.g., a bus) are 
open, i.e., allow signal to pass, when their associated control signals 
are on; the associated data manipulation units, on the other hand, 
will perform predefined operations. 


Before we go to the formal definition of the concepts of semantics 
and semantic equivalence, let us look at some simple examples. 
Under the above formulation, a simple adder with two input ports 
and one output port can be modelled as a vertex Vi with I( V1) 
{Pi1, Pie}, O( V1) = {Po1}. A register can be modelled as a vertex 
Ve with I(Ve) = {Pis} and O(Ve) = {Poe}. A data path which © 
connects the output of the adder to the register can be modelled as 
an arc Ai = (Po1, Pis), which states that the output port of the V1 
component is connected to the input port of Ve. 


If the output of the adder is only fed into the register when control 
state Si is on, then Az € C(S1) and {Ve, A1} C ASS(Sz). Note 
that Vi need not necessarily be associated with S1; if, for example, 
the adder has a local accumulator, a series of additions can be 
performed and finally the sum be fed into register Ve when S1 is on. 
When the sum is being sent to Ve, V1 can continue with another 
addition associated with, e.g., Se without conflict. 


3. Semantics of the Model 


We now turn our attention to the definition of the semantics of the 
proposed computation model. The basic idea is that we can 
characterize the semantics of a system by the external events, i.e., 
its interactions with the outside world. An external event is either a 
read or write operation of the externally accessible ports. The 
semantics of a hardware system is defined as a set of events 
observed in its external ports. 


Before formally giving the definition of semantics of the 
computation model, we have to define the behaviors of the system 
which is in turn based on the execution rules of the control Petri 
net and its interaction with the data path. 


Definition 3.1: Given a data/control system I = (D, S, T, F, C, 
G, Mo), its behavior is defined as below: 


1. A function M: S—N is called a marking of f (N = { 0, 1, 
2, ... }). A marking is an assignment of tokens to the 
S-elements. 


2. Initially there is a token in each of the initial control states, 
or the set of S-elements S: such that Mo(S:) = 1 as defined by 
the initial marking Mo. 


3. A transition T is enabled at a marking M iff for every S such 
that (S, T) € F,M(S) > 1; that is, all the T-elements’ input 


control states have at least one token. 


4. A transition T may be fired when it is enabled and the guard 
condition is true ({i.e., the output port which guards T has a 
TRUE value). If a transition has more than one guard 
condition, an OR operation is applied to them; therefore, if 
any guard condition is true, the transition’s guard condition 
as a whole is true. 


5. Firing an enabled transition T removes a token from each of 
its input control states and deposits a token in each of its 
output control states. 


6. If no token exists in any of the control states, the execution is 
terminated. 


7. Y(P) is the data value present at port P. 


8. When a control state, S, holds a token, its associated arcs in 
the data path will open for data to flow; ie., the data value 
presents at the input port, J, is equal to the corresponding 
output port, O, which is denoted as VI) 4S Y(O). 


9. For every vertex V, Y(O) := OP(V(I(V))), where OP € B(O). 
The assignment operator, :=, means that if OP is sequential 
it takes the last defined value of the expression; otherwise it 
takes the present value of the expression. 


10. If all the pending arcs of an input port are not active, its 
value is undefined. If the operation of an output port is not a 
sequential one and the output port depends on an undefined 
input value, its value is also undefined. 


The possible existence of an undefined value and the intrinsic 
non-deterministic properties of the Petri net firing sequence 
together result in difficulties in determining the behavior of a 
system. We would like to exclude the nondeterministic properties 
by the following definition. 


Definition 3.2: A data/control flow system I = (D, S, T, F, C, 
G, Mo) is properly designed if: 


1. ASS(S:i) NM ASS(S;) = Q, if Si |] Si. 


2. There should not be more than one token appearing at the 
same control state; that is, the Petri net must be safe. 


3. If (S, Tz) EF, (S, Te) € F, T1€ G(Po1), and Te€ G(Poe), 
then V(Pot) AND V(Por) = FALSE. That is, the Petri net 


must be conflict-free. 


4. The subgraph that belongs to a control state should not 
include a combinatorial loop. 


5. VSiE€S ASS(S:) must include at least one sequential 
vertex. 


This definition singles out those data/control flow systems which 
are safe, conflict-free and well-behaved. From now on we only 
consider properly designed systems. 


Definition 3.3: For a data path D = (V, I, O, A, R), there is a 
set of external vertices, Ve, which only have either one single input 
port (the set of output vertices, Vo) or one single output port (the 
set of input vertices, Vi). The set of ports of the external vertices 
Ve are called external ports. The set of arcs, Ae, which connect to 
the external ports, are called external arcs. 


Definition 3.4: A external event is a pair (A:, vi), with Ai being an 
external arc and wi a value passed over the arc. A external event is 
controlled by, or labelled with, the Petri net control state that is 
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associated with the arc. That is, the external event happens at the 
time when the associated control state has a token. 


Definition 3.5: Given a data/control flow system I = (D, S, T, 
F, C, G, Mo), its external event structure is defined as S(T) = (E, 
~<, <) where 


E = {£1, Ee, ..., En} is a set of external events; 


< C(E x E) is a binary relation, the precedent relation. Ex < 
Ej with Ex = (Ais, vi) and E; = (Aj, vj), iff Ex occurs before Ej 
and Si => S;, where Ai € C(Si) and Aj € C(S); 


<x C(E x E) is a binary relation, the concurrent relation. Ei X 
E; with Ei = (Ai, vi) and Ej = (Aj, vy), iff Ei and Ej occurs at 
the same time and Ai € C(S), Aj € C(S). 


An external event structure specifies all the possible external events 
of a system as well as the temporal relationship between them. If 
two external events are in the precedent (concurrent) relation, they 
must always occur in the specified order (simultaneously). On the 
other hand, if two events are not in either of the two relations, they 
can occur in any order and are said to be in a casual relation. In a 
distributed system with a set of modules, for example, the temporal 
relations between some of the external events of two different 
modules can best be expressed as having a casual relation. Trying 
to force a total ordering on events of different modules will simply 
introduce unnecessary constraints and make it difficult to 
implement the system. 


In the above discussion, we assume that when an external event 
occurs whose operation is to obtain a value from the outside world, 
the environment will supply a value of the appropriate type to the 
system. We also assume that a sequence of such values is implicitly 
predefined for each input vertex, when an external event structure 
is specified. 


Definition 3.6: The semantics of a data/control flow system I, 
denoted also by §(T), is defined by its external event structure. 


4. Semantics Equivalence 


Two systems are considered to be semantically equivalent if they 
behave identically with respect to the corresponding external ports; 
their internal behavior does not matter. 


Definition 4.1: Two data/control flow systems I and I’ are 
semantically equivalent, denoted by T=T’, if S(T) = S(T’). | 


For the purpose of synthesis, however, the above semantic 
equivalence relation is still too weak. In general, it is undecidable 
whether two systems are equivalent to each other by this definition. 
It is very difficult, or simply impossible in some cases, to analyze a 
data/control flow system and obtain the complete external event 
structure as specified by definition 3.5. We have thus to introduce a 
stronger equivalence relation which requires every data dependence 
operation to be carried out in exactly the same order. This latter 
requirement is stronger than necessary. For example, two addition 
operations can be carried out in reverse order without changing the 
outcome of the computation. This strong definition, however, 
greatly reduces the complexity of the synthesis process and still 
provides enough room for the optimization algorithm to make large 
changes in the described system. 


Definition 4.2: The domain of a control state S, denoted as 
dom(‘S), is defined as the set of vertices that have some output port 
connected to an arc controlled by S. The codomain of S, denoted as 
cod(5), is defined as the set of vertices which have some input port 
connected to an arc controlled by S. The operations performed on a 
control state S are the set of operations defined on the output ports 
of its codomain. The subset of vertices of the codomain of S that 
consists of some sequential output ports is called the result set of S 


and denoted as R(S). 


Definition 4.3: Si and S; are directly data dependent, denoted as 
Si «+ $3, if one of the following is true: 


(a) R(Si) Ndom(5S;) # @. 


(b) R(S) ndom(s:) # @. 

(c) R(Si)) NR(S;) F ©. 

(d) Si and 5S; are in a control dependence relation; ie., M(S:) 
depends on a subset of R(5;) or vice versa. 

(e) C(S:) and C(S;) both contain some external arcs. 


Definition 4.4: The transitive closure of ++, denoted by 9%, ie., > 
= «" , is called a data dependence relation. 


The data dependence relation is defined as the relationship between 
the operations which will contribute ”data” to each other; in other 
words, two operations are data dependent if they must be executed 
in the predefined order in order to retain the semantic integrity of 
the prescribed computation. Those sets of control signals which are 
not in a data dependence relation, however, can be arranged in any 
order without changing the semantics of the system. 


Definition 4.5: Given T = (D, S, T, F, C, G, Mc) and I’ = (D, 
S, T’, F’, C, G, Mo), T and I’ are data-invariantly equivalent to 
each other, iff 

for every Si=> Sj and Si SyinT (S:ES, S7ES), 

we have Si’ Syand SiO’ S;inT’; 
and vice versa. 

Nek 

The above definition ensures that two operations are performed in 
parallel only if they are data independent and all of the data 
dependent operations in the two systems are performed exactly in 
the same order. Therefore the data-invariant equivalence relation 
satisfies the semantic equivalence relation. This means that we can 
reconstruct the control structure (without changing the data path) 
of a hardware system to improve system performance, for example, 
by carrying out as much operations in parallel as possible. 


Theorem 4.1: The data-invariant equivalence relation satisfies the 
semantic equivalence relation. 


Proof 4.1: see the appendiz. 


Definition 4.6: Given I = (D, S, T, F, C, G, Mc) with D = (V, 
I, O, A, B) and I’ = (D’, S, T, F, C, G’, Mo) with D’ = (V’, 
I’, O’, A’, B’), T and I’ are control-invariantly equivalent to 
each other, iff ['’ is the result of a vertex merger of Vi into V; of T, 
both Vi and V; have the same operational definition and port 
structure, and their associated control states are in sequential order. 
The result of a vertex merger is defined as: 


V'=V-{VI. 

I’ = I- {1(V)}. 

O’ =I- {O( Vi}. 

A’ is the same as A except that each (Oi, I} with O: € O( Vi) is 
replaced by (Oj, J} with O;€O(V;) and each (O, Ki) with 
Ik € 1( Vz) replaced by (O, Jj) with Ij € I( V5). 

G’ is the same as G except that each T € G(Oi) is substituted 
by TE G(O). 


The intrinsic property of a merger operation is to share hardware 
resources by operations so as to improve the implementation in 
terms of cost. For example two addition operations can be 
implemented with the same adder by merging the two addition 
vertices together. By merging communication channels together we 
can also create structure components like buses in the 
implementation. 


As a merger is only performed when the two vertices have their 
associated control states in sequential order, they will not attempt 
to use the vertex at the same time. As such the two sets of 
operations can share the same operator safely. Because the two 
vertices to be merged also have the same operational definition and 
port structure, the merger will not change the computational aspect 
of the given system. 


Theorem 4.2: The control-invariant equivalence relation satisfies 
the semantic equivalence relation. 


Proof 4.2: see the appendiz. 
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5. Hardware Synthesis 


This section discusses briefly the application of the proposed 
parallel computation model in a hardware synthesis environment. 
For a detailed description of the synthesis algorithms and 
comparisons to other related works, please see [3] and [4]. 


To synthesize hardware from some algorithmic description of its 
behavior, we first transform the description into the data/control 
flow notation. Based on such a formal description, some formal 
analysis techniques can first be used to check whether the systems 
are properly designed before the synthesis process starts [4]. 


The major part of the synthesis process is carried out by a sequence 
of control-invariant and data-invariant transformations as defined 
in the previous section. Since both transformations do not change 
the semantics of the system, they can freely be applied to transform 
a design to satisfy certain given criteria. For example, adding one 
more control flow path in the Petri net and possibly additional data 
manipulation units in the data path will allow more operation units 
to operate at the same time, thus increasing the parallelism of the 
computation. 


The synthesis algorithm starts with a preliminary design and 
transforms it step by step towards an optimal one. As from each 
step there are usually several ways to go, it is necessary to have 
some strategy to guide the transformation process. A critical path 
analysis technique is used for this purpose. The set of trans- 
formation, analysis, and optimization algorithms has been designed 
and implemented in the CAMAD design aid system [3], [4]. 


6. Conclusions 


We have given the formal definition of a data/control flow model 
for parallel computation and its semantic equivalence notation. The 
concept of semantic equivalence is defined based on two criteria. 
First the functional relationship between each output variable and 
its relevant input variables must be the same; secondly the 
temporal relationship between input/output operations should also 
be the same. 


Unlike other computation models used mainly for descriptive and 
analysis purposes, the proposed model addresses issues of design 
directly and allows graphical representations of the structures as 
well as behaviors of hardware system. To apply this model for 
hardware synthesis, we have introduced two basic transformations 
which change the internal structure of the hardware but keep the 
data dependency operations in the predefined order. The 
requirement that all data dependency operations be carried out in 
the predefined order is actually stronger than necessary. For 
example, two addition operations can be carried out in a reversed 
order without changing the outcome of the computation. It, 
however, greatly reduces the complexity of the synthesis process. 
The use of such a formal computation model to represent the design 
of parallel hardware has led to the efficient use of CAD and 
automatic tools in the synthesis process. 


Appendix 


Proof 4.1: Let I = (D, 8, T, F, C, G, Mo), T’ = (D,S, T’, 
F’, C, G, Mo), and I and I’ are data-invariant equivalent to each 
other, ie., for every Si=> Sj and S58 S;sin T (Si€S, S;ES8), we 
have Si=x>’ Sj and Si’ Sj; in I’; and vice versa. We will show 
that the external event structure of I and that of I'’ are the same. 


Suppose that a sequence of external events, (Ai, vit), (Ai, vie), (Ai, 

vi3), ..., are observed in arc A: which is associated with control state 

Sin system I. As the data path of system I’ is the same as that of 

IT’, Ai should also be present in I’ as an external arc and controlled 

by Sin T’. 

For the values exchanged over A:, we have two situations: 

(1) If Ai is connected to an input vertex, the function of the 
external events is to input data from the environment. The 


values passed over the arc are then provided by the 
environment. As we assume that the sequence of such values 


provided for each input vertex is fixed when we check the 
semantic equivalence relation between different systems, the 
same sequence of external events will be observed in system 
i bona 

(2) If As is connected to an output vertex, the function of the 
external events is to output data to the environment. The 
values passed over Ai are, therefore, determined by the 
computation performed by the systems. 


Let Ai = (O, J), and when M(S) = 1, an external event (Ai, vs) 
occurs with vi = V(O) (definition 3.4). If O€ O( V) and Vis an 
input vertex (ie., Az connects an input vertex directly to an 
output vertex), V(O) depends again on the environment. 
Therefore, both systems exchange the same values at Ai. 


If O€ O(V) and V is not an input vertex, we have V(O) := 
OP(V(I(V))), where OPE B(O); and V(ii) 4S: V(O:) for each ii 
€ 1( V) (definition 3.1). As VE dom(S) and VE R(S:) (we have 
assumed that V is a sequential vertex, without loss of 
generality), we have S® Si and, therefore, Si => S. Since both 
systems have the same data path and Si => S in both 
situations, the values exchanged at Ai should be the same 
provided that each V(O:) for the corresponding systems is the 
same. 


To show that Y(0O:) is the same for I and I'’, we can use the 
same proof process as above. This recursive procedure will also 
converge to the situation where V is an input vertex. At that 
time the same argument as from (1) can be applied again. 
Therefore, the the same sequence of external events will be 
observed in A¢ of both system T and T’’. 


From (1) and (2), it is clear that the sequence of external events 
observed at Ai of I’ is exactly the same as that of I in any 
situation. As Ai can be any arbitrary external arc, this means that 
the sequence of external events which occur at every external arc is 
the same for both systems. 


As I and I’ have the same number of corresponding external arcs, 
it follows from the above result that the complete sets of external 
events for both systems are the same. 


Next let us look at the partial relation between the external events 
of the two systems. Suppose that Ei < Ey with Ei = (Ai, v) and Ey 
= (Aj, vj) in I’. That is, Ei occurs before Ej and Si => $3, where Ai 
€ C(S:i) and A; € C(5;). By definition, we have Si >’ Sj in T’, 
where Ai € C(S:) and Aj € C(S;), because Si > S; (thus Si >’ Sj). 


Assuming &;j occurs before Hi in I’, then we must have Sj => Si in 
both [’ and IT. That is, Si and S$; are in a loop situation. 
Consequently, there exists a total ordering between the external 
events associated with these two control states, and it should be the 
same in both f and I’. Thus the assumption that Ej occurs before 
Ei in I’ is a contradiction. Therefore we have also Ei occurs before 


Ej in Y’. That is,Ei < Ej is alsoinT’’. 


Finally, we show that the concurrent relations of both systems are 
‘ also the same as follows. 


If Hi = (As, vs) and Ey = (Ay, vy) occur at the same time and Ai € 
C(S), A; € C(S) in T, then we should have Ai € C(S), Aj € C(S) 
in I’, because the control mapping C is the same for both systems. 
Consequently, Hi and £; should also occur at the same time in I’ 
as they are associated with the same control state. Therefore, both 
system have the same concurrent relation. 


Since both system IT and I’ have the same external event set, the 
same precedent relation, and the same concurrent relation, $(T) = 
$(I'’). That is, they are semantically equivalent to each other. 


Proof 4.2: Let fT and I’ be control-invariant equivalent to each 
other. That is, (a) I'’ is resulted from a vertex merger of Vi into V; 
of I’, (b) both Vi and V; have the same operational definition and 
port structure, and (c) their associated control states are in 
sequential order. 
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Assume that the merger of Vi and Vj; changes the semantics of the 
system. That is, §(T) # S(T’), or (E, <, X) # (E’, <’, x’). 
Because the control structures of both systems are the same, the 
temporal relationship between any two control states remain the 
same for both systems. 


As the number of arcs also remains the same after the merger 
operation and they are controlled by the same control states, the 
number of external events and their temporal relation remain the 
same for both systems. That is, the precedent relation and 
concurrent relation of both systems are the same. Therefore, the 
only. possible difference between the two external event structures is 
that some of the external events have different values. 


For the external events that occur at an arc connected to an input 
vertex, the same argument of Proof 4.1(1) can be used to prove that 
both I and IT’’ have the same values passed in these external 
events. 


For the external arcs that are connected to an output port, let A = 
(O, 1) with IE I( Ve) and VeE Vo; { (A, vit), (A, vie), (A, ves), ... } 
C E and occur in the listing order in T; and { (A, vj1), (A, we), (A, 
vjs), ... } CE’ and occur in the listing order in T’’. 


Let also (A, vie) and (A, vk) occur when M(S) = 1 in T and TY’ 
respectively. We have vie = Y(O) in T and ve = V(O) in T’. 


IfO € O(V) and V is an input vertex in I’, we have alsoO € O(V) 
and V as an input vertex in I'’. Since in both cases Y(O) depends 
on the environment, vik = uyk. 


IfO € O(V), V is not an input vertex, and V # Vi, we have 
Y(O) := OP(V(I(V))), where OPE B(O) both in T and I’. By 
definition 3.1, V(i) 4S: V(O:) for each I; € I(V). As both system 
have Si => S (see Proof 4.1), vie = vyk, provided that each Y( Oi) for 
both systems is the same. 


IfO € O(V), Vis not an input vertex, and V= Vi in T, we have 
VO) := OP(VI(Vi))), where OPE B(O) in T and VO) := 
OP(V(I(V;))), where OP € B(O) in I’. Since (a) both Vi and V; 
have the same operational definition and port structure; (b) V(J) 
4S: V(O:) for each k € 1( Vi) in T with V(Li) 45: V(O:) for each kt € 
1(V;) in I’; and (c) both systems have Si => S; we have vik = vjk, 
provided that each V(O:) for both systems is the same. 


To show that V(O:) is the same for T and I’ in the above two 
cases, we can use the same proof process again. The recursive 
procedure will also converge to the situation where V is an input 
vertex; then the same argument as from Proof 4.1(1) can be 


applied. Therefore, the same sequence of external events will be 
observed in A of both I and TI’. 


This result contradicts the assumption that some corresponding 
events of I and I’’ are different. That is, the assumption must be 


false. Therefore, I’ = IT’. 
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Abstract 


This paper presents an asynchronous distributed approach for the 
simulation of behavior-level models representing complex digital and 
VLSI components on a parallel processor. The underlying architec- 
ture is a set of concurrent processors that share data through explicit 
messages such as a hypercube [1]. The approach is implemented on 
the Bell Labs hypercube [2] that consists of 64 concurrent processors 
connected by a network of point-to-point communication channels in 
the plan of a binary 6-cube and provides a protocol-based operating 
system. A complex design is first partitioned and the behavior-level 
models corresp onding to the components of each partition are assigned 
to a processor. A model determines, based on the input signal tran- 
sitions at the input ports, whether it may be scheduled for execution 
and, consequently, scheduling is distributed in the models. However, 
within each processor, only one behavior model may execute at any 
time instant. During execution of a behavior description, the signal 
transitions at an output port may be determined based on the signal 
values at all input ports defined up to t = ¢; such that every input 
signal is is defined up to t = t,. In addition, the assertion of a signal 


transition at an output port is deferred until the model description 
may determine with certainty that no future input signals may prove 


it inconsistent and require its deletion [3,4]. The behavior of digital 
and VLSI components including complex timing are expressed through 


the language constructs of C++ [5]. 


1 Introduction 

The discipline of synchronous distributed simulation of digital designs 
at the logic level on parallel processors has been addressed by the 
Yorktown Simulation Engine [6], IBM Los Gatos Logic Simulation 
Machine [7], and ZYCAD [8]. The subject of asynchronous distributed 
simulation with a focus on queuing networks has been addressed in the 
recent past by Misra [9], Chandy [10], Lamport [11] and Peacock [12]. 
The Daisy Megalogician [13] and ULTIMATE [14] machines address 


the issues of parallelizing a simulation algorithm. 


Behavior models of complex digital and VLSI devices are flexible 
and provide a competitive means of system simulation [15] and results 
of simulation are more comprehensive to the high-level architects as 
opposed to the gate-level simulation results. Consequently, the impor- 


tance of distributed simulation of such models on parallel processors is 


obvious. The difference between this approach and the one proposed: 


by Misra [9] may be expressed as follows. 


. Accurate representation of components’ behavior including tim- 
ing in the models require the representation of the unique high-to-low 
(toni) and low-to-high (tpn) propagation delays for every component. 
For an input signal transition at t = ¢,, the “predictability” condition 


[9] would imply the generation of an output transition at t= 1 + tpn: 
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or t = t; + tpi, depending on the nature of the transition and its asser- 
tion at the output port. The predictability condition is an important 
aspect of the approach proposed by Misra [9]. Such an assertion may 
cause incorrect simulation results as an input signal transition at a fu- 


ture time t = ta (t2 > ¢;) may, under certain circumstances, generate a 
new output transition that requires the previous output transition to 


be discarded [3,4]. The cause of such potentially unreliable simulation 
results may be attributed to the anticipatory semantics of the behav- 
ior description language and event driven simulation. In the approach 
presented in this paper, the behavior description first determines with 
certainty that an output transition may not be discarded and then 
asserts it at the output port. In contrast to Chandy’s [10] proposal 
of simulating to a deadlock and then recovering from it, the approach 


presented here may be characterized by an absence of deadlocks. 


2 Asynchronous Distributed Simulation on Parallel 
Processors 
An asynchronous distributed approach for the simulation of behavior 


models on a special parallel processor architecture - hypercube, is pre- 
sented in this section. The potential advantage of this approach over 
conventional sequential simulation on an uniprocessor is faster speed. 
Execution of digital or VLSI hardware may be characterized by ex- 
change of signals between the component modules that is constituted 
by a sequence of signal transitions. A transition may be character- 
ized by a logical value and assertion time. In conventional simulation, 
the ordering of the signal transitions or events for correct results is 
achieved through a global entity - time, and centralized control. In 
this approach, the ordering is guaranteed by a sequence of messages 
between the models and their proper interpretation and usage by each 
of the behavior models. In this paper a message represents a signal 
transition. The overall philosophy may be expressed as follows. Each 
and every behavior model correctly interprets messages at the input 
ports, determines the output signal transition based on the input sig- 
nals, and asserts only correct output assignments at the output port 
through messages. Consequently, for a given set of external signals 
at the primary input ports, correct simulation results are guaranteed. 


In addition, explicit identification of the clock lines are not required 
and as transitions corresponding to every signal including clocks may 


be expressed through messages and the output determined by the be- 
havior description in a model solely based on the input transitions, 
synchronous and asynchronous including self-timed designs may be 
simulated in this approach. | 

First a given digital or VLSI design is partitioned into 63 or less 
partitions corresponding to 63 processors and processor 0 is dedicated 
to the task of asserting the external signal transitions at the primary 
input ports of the design. For a modest-size design with less than 63 
behavior models, each processor may be allocated a model for simula- 


tion. 


The task of scheduling behavior models for execution is distributed 
in them and a model schedules itself when it determines that necessary 
conditions, described subsequently, have been satisfied at the input 
ports. Given n input ports h,...,[, of a component C and signal 
transitions at the ports defined up tot = ¢1, ...,t =t,, respectively, the 
corresponding model may execute and determine the signal transition 
at the output port that is based on the input signals defined up to 
t = t, where t, is the minimum of {f1, ..., tn}. Assuming a value 
“d” for the propagation delay of the component, the output signal 
transition may be defined at t = t, + d but its assertion is deferred 
because of the possibility that a future input transition defined at 
t > t, may cause an output transition that is inconsistent with the 
previously generated transition at t = t, + d and require its deletion. 
An output transition that is defined at t < t, and was generated 
corresponding to a previous execution of the model may not be affected 
by any future input transition defined at t > ¢, and the behavior model 
may, with certainty, assert the transition at the output port. This 
principle is referred to as the deferred assertion of output assignments. 
The issue of generation, detection, and deletion of inconsistent output 
assignments is detailed in [3,4] and is not presented here. 

Consider the simulation of a circuit shown in Figure 1. Although 
a simple circuit is chosen for simplicity of explanation, the distributed 
approach applies equally to complex behavior models. The output 
ports of components A and B are connected to the inputs of the two- 
input AND gate C and the signal transitions generated by each of A 
and B between t = 0 and t = 30 are shown in Figure 1. Assuming 
models A and B are allocated arbitrarily to processors 2 and 3, A and 
B are executed asynchronously and, as a result, the real time during 
simulation at which the signal transitions are propagated from A to C 
and B to C may not relate to each other. The individual transitions 
from A toC — ny: Oatt = 0, ng: 1 at t = 10, ng: 0 at t = 20, and nq: 
1 at t = 30 where t represents the simulation time are guaranteed to 
be asserted in order that is represented in logical time [11] as shown 
in Figure 2a. Figure 2b represents a similar ordering of the transitions 
from B to C — m,: 1 at t = 0, me: 0 at t = 20, and m3: 1 at t = 
30 in logical time. For.the purpose of explanation, assume that the 
ordering of the transitions in real time is represented by Figure 3. The 
correctness of the distributed approach is invariant to the ordering of 
the events in real time given that the logical ordering specified in each 
of the Figures 2a and 2b is preserved. In Figure 3, assume mj, nj, no, 
ng, m2, mg, and nq are asserted at C at real times T = s,, T = 82, 
T = s3, T = s4, T = 85, T = sg, and T = s7 respectively, where T 
represents the progress of real time during simulation on the parallel 


processor. 


Corresponding to the assertion of an input transition at T = sj, 
C is unable to schedule for execution as the signal transition at port 
2 is yet to be specified for t = 0 where t represents the progress of 
simulation time and corresponds to the hardware execution. At T = 
$2, C schedules itself for execution and an output transition 1,: 0 at 
t= 0 Fe 5 = 5 is determined. Given that the previous value of the 


output was 0, J; does not imply any new information and is ignored. 
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Corresponding to each of nz and ng at T = s3 and T = s4 respectively, 
the behavior description of C is not executed as the signal transition 
at port 2 has not been asserted beyond t = 0. At T = 55, signal 
transitions have been specified at t = 20 at both ports 1 and 2 and C 
is scheduled for execution. Output transitions lj: 1 at t = 10 + 16 = 
26 and I3: 0 at t = 20 + 5 = 25 are generated but lz is observed to 
be inconsistent with lz; and consequently discarded. Assertion of the 
output transition /3: 0 at t = 25 is deferred as the transitions at the 
input port of C are defined at t = 20 and the model is yet unable to 
conclude with certainty that lz may not be discarded in the future. At 
T = S¢, transition m3 is-asserted at an input port of C but the input 
signal at port 1 is yet undefined beyond t = 20. Consequently neither 
C may be executed nor any decision regarding l3 be finalized. At T 
= s7, input signals at ports 1 and 2 are both defined at t = 30 and 
given that Is has not yet been shown inconsistent and 30 > 25, it may 
be asserted, with certainty, at the output port of C and consequently 
propagated to other components that are connected to the output 
port of C. In addition, the behavior description of C is executed and 
an output assignment l4: 1 at t = 30 + 16 = 46 is determined and 


stored within the model. 


3 Blocking and Deadlock 


The number of active gates during gate-level simulation has generally 
been observed to be between 5% and 20% and may be assumed to hold 
true for behavior level simulation. Such a low activity may cause the 
following scenario during asynchronous distributed simulation. For 
example, in Figure 1 assume component A executes a number of times 
due to signal transitions at its input ports between t = Ons and t = 
1000ns (say) and asserts a number of transitions at port 1 of C. Also 
assume B executes infrequently due to a limited set of transitions at 
its input port between t = Ons and t = 1000ns with the consequence 
that only one transition at t = 2ns (say) is asserted at port 2 of C. The 
value of the signal at port 2 remains essentially unchanged between 
ie : 

t 


being component B. In addition, other components, if any, that are 


2ns and t =1000ns. Consequently, C may not execute beyond 


2ns and this situation constitutes blocking [12] with the source 


connected to the output port of C either directly or indirectly will 


be blocked implying a possibility of very low overall activity during 
simulation. 


Blocking does not correspond to a physical process in hardware ex- 
ecution and its cause may be explained as follows. Event driven simu- 
lation with selective trace requires, for efficiency, that only changes in 
the logical value of a signal be propagated. Consequently, the value of 
the signal between two consecutive transitions e; and e2 is identical 
to the value indicated in e; in an uniprocessor environment. Such an 
assumption is dangerous in the distributed asynchronous simulation 
on a parallel processor as a message to the input port of a component 
may be delayed due to asynchrony and the behavior model may erro- 
neously interpret the absence of message to imply “no change” in the 
logical value at that port. Consequently, a component must execute 
based on signals at input ports at t = ¢; such that transitions have 


been asserted at all input ports at t > t;. Such a mechanism as well as 


the principle of deferred assertion of output assignments may increase 


the possibility of occurrence of blocking. 


In the event that blocking occurs during simulation of a design, it is 
first detected in the following manner. When the number of input as- 
signments at an input port of a component that have not yet been used 
to generate output events exceed a threshold, the component raises an 
exception. As a consequence of the exception, the execution mode of 
every processor is set.to “exception-mode”. The execution mode of the 
processors is reset from exception-mode when the cause of blocking is 
removed i.e., the number of outstanding input assignments at the in- 
put port of the component falls below the threshold. The actual value 
of the threshold is empirically determined and it influences the rel- 
ative durations of normal- and exception-modes during a simulation. 
The characteristics of the exception-mode may be expressed as follows. 
Signal values are asserted at all input ports of components including 
the primary input ports even when the logical values are unchanged 


from their previous values. In addition, when a model is executed at t 
= t,, either a previously generated correct signal transition that was 


not yet asserted at the output port is propagated to the output or the 
most recent logical value at the output is asserted at t = t; plus the 
minimum of the high-to-low and low-to-high propagation delays of the 
component. | 

Assume that the components A and B in Figure 1 are executed on 
processor I of a parallel processor system while model C is executed on 
another processor II of the system. Assume further that a significant 
number of signal transitions are asserted at the input ports of A and 
that the signals at the input ports of B are virtually unchanged in their 
logical values. Consequently, A is executed frequently and B is exe- 
cuted very infrequently and very few output transitions are asserted at 
the input port 2 of C. The model C is unable to execute in the absence 
of signal transitions at the input port 2 and the number of outstanding 
input entries at the input port 1 of C may exceed the threshold. Con- 
sequently, C raises an exception and the execution modes of all the 
processors is set to exception-mode. In this mode, signal transitions 
are asserted at the input ports of B even though they are unchanged 
in their logical values. Consequently, B is executed more frequently 
~ and a modest number of output transitions are asserted at the port 
2 of C. The model C is executed and the outstanding entries at the 
port 1 are utilized to generate output assignments and the cause of 
blocking is removed. 

The possibility of deadlocks during asynchronous distributed simu- 
lation of designs with feedback loops and their resolution is addressed 
in the remainder of this section. Consider simulation of a simple latch 
shown in Figure 4. 

Assume the presence of signal transitions defined between t = Ons 
and t = 1000ns at the input ports 1 and k of components A and B 
respectively. Neither A nor B may execute as explained subsequently. 
A may schedule itself for execution when transitions are propagated 
to its input port R from the output of B following execution of B. 
However, B may not schedule itself for execution until A has executed 
and asserted transitions at its input ports. Consequently, a deadlock 


is achieved. This paper presents an approach that ensures absence 
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of deadlocks and implements the principle of deferred scheduling for 
correctness of the results. A somewhat similar approach has been 


proposed by Peacock [12]. 


Every component on a feedback arc is identified and the behav- 
ior descriptions corresponding to such components are modified to 
perform the following action. Given that tp,; and tpi, values are as- 
sociated with every component, execution of a behavior model at t = 
t; may generate an output transition at t = t1 + tpi, or t = t1 + 
tpnt depending on the nature of the transition. Where tpi, < tpn: and 
the assertion time of the output transition is given by t = ¢1 + tpar, 
the transition is stored within the body of the model and its assertion 
deferred until a later time. Instead, a timestamp with the assertion 
time given by t = tl + tpia is generated and propagated through the 
output port as the logical value of the signal at the output port will, 
with certainty, remain unchanged up to t < t; + tpi,. Where the as- 
sertion time of the output transition is given by t = t; + tpin, it may 
be asserted at the output port immediately as no future transitions 
at the input port beyond t = ¢; may cause the output transition to 
be discarded. A limitation of this approach is that the efficiency of 
simulation of circuits with feedback loops may be low when the com- 


ponents constituting the feedback loop are distributed over 2 or more 
processors and the frequency of the signal at the non feedback port is 


considerably lower as compared to the sum of the propagation delays 


of the components constituting the feedback loop. 


4 Analysis of Performance of the Asynchronous 
Distributed Approach 


The asynchronous distributed simulation approach has been imple- 
mented on the Bell Labs hypercube [2] that consists of 64 concurrent 
processors and provides a protocol-based operating system. The be- 
havior models of VLSI and digital components are described through 


the C++ [5] language constructs. 


In an experiment to estimate the performance of the asynchronous 
distributed approach, a typical example design - two-bit adder, is con- 
sidered where the individual gates are replaced by models whose exe- 
cution times may be parametrically controlled. The model execution 
times are varied from 0.34ms through 3.4ms, 34ms, 170ms, and 340ms 
to 3.4 sec and are based on estimates of model sizes of AM2903, Intel 
8086, Motorola 6809, and the VHDL benchmarks. First, in the ex- 
periment the entire design is simulated on a single processor. Then, 
the circuit is partitioned into two, four, eight, and sixteen parts and 
simulated with 2, 4, 8, and 16 processors. For each case, performance 
data is collected by varying the number of input vectors from 100 to 
1000 and the model sizes from 0.34ms to 34sec. 


The graphs in Figures 5a, 5b, and 5c present a logarithmic plot of 
the CPU time versus the input vector size for varying model sizes for 
the cases of 1, 4, and 8 processors. It may be observed from the graphs 
that the performance of the algorithm is linear. The graphs in Figure 
6 present a logarithmic plot of the CPU time versus the model size for 


varying input vector sizes for a four processor simulation. The knees of 


the individual plots corresponding to the model size of 0.34ms reflect 


the dominance of message communication in the hypercube over model 
computation for model sizes srnaller than 0.34ms and the dominance 
of the model computation over communication for model sizes larger 
than 0.34ms. Figure 7 presents a plot of the speedup factor versus the 
number of processors for three specific pairs of model and vector sizes. 
The graph corresponding to the model size of 0.34ms and vector size 
100 resembles a saturation curve and refelcts the dominance of the 
message communication over model computation in the hypercube. 
The other two graphs are both linear indicating that the speed up 
factor increases linearly with increasing number of processors and, 
consequently, the performance of the proposed approach is linear. The 
maximum speedup factor for the example is observed to be 12 when 
the design is partitioned and simulated with 16 processors. The slope 
differences of the graphs also indicate that increasing CPU time ‘is 
spent in model computation as opposed to communication and other 


overhead for increasing model sizes. 


5 Conclusions 


This paper has presented a distributed asynchronous approach for the 
simulation of behavior models on parallel processors. In this approach, 
a model determines based on the input signal transitions at the input 
ports whether it may be scheduled for computation and, consequently, 
scheduling is distributed in all the models. In addition, the principle 
of deferred scheduling ensures that inconsistent output events are de- 
tected and deleted with the consequence that correct signals are gener- 
ated. The approach guarantees the absence of deadlocks and resolves 
blocking by temporarily forcing the execution mode of all processors 


to exception-mode wherein the cause of blocking is removed. The ap- 


proach has been implemented on the Bell Labs hypercube and the data. 


obtained from the simulation of designs indicate that the performance 


of the approach is linear. 
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Figure 1: AN EXAMPLE DESIGN. 
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Figure 2; LOGICAL ORDERING OF EVENTS IN AN ASYNCHRONOUS 
DISTRIBUTED SIMULATION. 
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Figure 3: REAL TIME ORDERING OF EVENTS IN 
AN ASYNCHRONOUS DISTRIBUTED SIMULATION 


Figure 4: SIMULATION OF A CIRCUIT 


WITH A FEEDBACK LOOP. 


log(CPU) log(CPU) log(CPU) 

10000ace ge 
1000eec eg arn pe age 

MS=84ma MS = 170 

L00eec MS=8.4ma MS = 34m. 

MS=$.4ma 
MS=0.84me ee 
10sec see MS=0.34ma MS=0,384me 

100 1000 = 100 1000 100 1000 


log(vector sise)} log(vector size) log(vector sise) 


Figure 5: GRAPHS OF CPU TIME VS. VECTOR SIZE. 
(a) Uniprocessor (b) Four processors (c) Eight processors 
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Hyper-rectangulars are a generalization of m-ary d-cube networks 
(arbitrary radix hypercubes), where the width of the network can be 
different in each dimension. This gives them configuration flexibil- 
ity advantages over their single radix constrained subset. Hyper- 
rectangulars are studied in four classes of configurations, one with 
all nodes in any one dimensional line in the graph connected to the 
same channel (bus), and the others with adjacent nodes connected 
with dedicated links. The dedicated links may be unidirectional or 
bidtrecttonal, and the nodes can be connected linearly or as a toroid. 
Given a uniform message rate from each node and untform target- 
ing to each node, simple formulae are derived for message traffic in 
all cases. Simplicity of the formulae for most cases does not suffer 
from the generalization to hyper-rectangular topology, and the 
results are more broadly applicable. Beyond the operational 
analysis, stochastic assumptions about the message rates are used to 


compute overall message latencies and queue lengths within the sys-_ 


tem. The results have been vertfied by simulation. 


1. Introduction 


Interconnection networks are important in the design of computer 
and communication systems and have been studied in great detail 
over many decades. Prior to the rise of computers during the last 
few decades, most of these studies centered on the telecommunica- 
tions domain, with its large conglomeration of terminal equipment 
and intermediate switching stations. Thus much of the early work 
in this area focused on networks comprised of two node types— 
terminal nodes and intermediate nodes. Messages originated at a 
terminal node (the source node) and were routed to another termi- 
nal node (the destination node) via the intermediate nodes. With 
the rise of computers, and particularly with the contemporary focus 
on parallel computing, more attention has been given to networks 
that simply interconnect computing nodes without making any dis- 
tinction between terminal nodes and intermediate nodes. 


A computer interconnection network has an abstract representation 
as a graph whose nodes are switching points and whose arcs are 
communication links. Messages originate at a source node and pass 
along one or more links to the destination node. If more than one 
link is employed along a path between the source and destination 
nodes, then intermediate nodes perform routing functions for the 
messages along the way. If every node in the network can both 
originate and absorb messages, as well as serve as intermediate 
nodes, then we say the network is static, whereas if there are some 
nodes that may only serve as intermediaries (i.e. for routing) then 
we say the network is dynamic [9]. 


In the following we discuss a general class of static networks which 
we call hyper-rectangulars; these are a direct generalization of the 
more common hypercube networks. 


Consider a connected graph of m‘ nodes with the following pro- 
perties: (i) each node is designated by a d-digit radix-m number, 
and (ii) there exists an arc between any two nodes whose numbers 
differ by one in exactly one digit position, and which are equal in 
all other digit positions. With these properties a network is called 
an m-ary d-cube, or hypercube. If m =2 then it is an example of 
a binary hypercube, of which there are a number of commercial 
examples [2]. If d=2 or d= then it is typically called a two- or 
a three-dimensional mesh structure. 


Hypercubes are interesting because of the simple routings which are 
possible (see below) and because of the range of simpler networks 
which can be topologically mapped onto them [8]. For a given 
number of nodes, M=—m/%, the optimal value for d, the hypercube 
dimensionality, is a matter of debate [3]. At least one author has 
proposed that lower dimensionality hypercubes, say d <5, are 
preferable to higher dimension hypercubes such as the binary 
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hypercube. While we take no position on the matter here, we note 
that it may be desirable to have some flexibility in the choice of 
M, the number of network nodes, regardless of the value of d 
chosen. For example if we choose d==4 and M=256, then m =4 
and the very next larger value of m, m==5, would multiply the 
size, and presumably the cost, of the system by over 244%, to 625 
nodes. It may also be desireable, perhaps for physical packaging 
reasons, to add nodes to a system on only one or a few dimensions. 
For these reasons we broaden our discourse from hypercubic to 
hyper-rectangular structures. 


If we generalize our d-digit node designators to a mixed radix 
number system, where the node positions in the 7th dimension, 
O0<7<d-1, are in the range 0, ..., m;—1, so that all dimensions are 
not necessarily the same width, then the resulting network is a 
hyper-rectangular network. (Our generalization differs from [1].) 


Routing 


There are several obvious routing algorithms for m-ary d-cubes, 
and most of them also apply for hyper-rectangulars. We use the 
standard left-to-right routing algorithm, which solves the routing 
problem one dimension at a time, starting from the lowest declared 
dimension. 

One could also turn this order around and do right-to-left routing, 
or any fixed permutation of the dimension orders could be chosen 
as long as it is consistent across all nodes. Even random selection of 
a dimension is acceptable, the important factor being consistent 
routing by all nodes so that the routing is uniform across the net- 
work. If non-uniform routing is employed then it is possible for 
expected channel message rates to be affected by non-uniformities 
in the routing strategy the analysis which follows may not apply. 


Notation 

def 

== number of dimensions, numbered 0, 1,..., d—1 

def . 
m,; == number of nodes in dimension 7, numbered 0, 1,..., m;—1 

def d-1 
M = [|[™, = total number of nodes. 

i=0 . 


A standard m-ary d-cube is a special case when m;=m Wi. 
Nodes are labeled according to their d-dimensional position in the 
structure: : 


def 


— 

MW = (no, my,..-,M¢-1), OS 2; < m,-1. 
An index n for node 7 is 

def d-i d-1 
n=n(v7)= Y>n* [[ m, 0<n <M 
k=0 t=k+1 
and 
def 
\q = Message rate originating out from node 7. 


2. Assumptions 


The throughput analysis below is based on the following assump- 
tions. Later on we will introduce additional assumptions that are 
required for computation of average queue lengths and response 
times. 


Assumption A: Each node sends messages to the network at the 


same average rate, \: 


Assumption B: For every message, the target node is selected 


from the uniform distribution over all nodes: 


Prob { node 7% is target } = 


M 


Message routing is done one dimension at a time, 
in any order, as long as all nodes use the same 
method, as discussed earlier. For example, the 
standard left-to-right routing algorithm can be 
used. 


Assumption C: 


Assumption D: The system is assumed to be in a steady state. 


This implies that flow balance [4] applies at 
every server (node or link), i.e., the message rate 
into a server equals the message rate out of it. It 
also implies that the network is not saturated. 


3. End-to-end Channels 


In this section we assume that all nodes on any one dimensional 
line are connected to the same channel. Technically the channel 
could be implemented as a bus or an ethernet, for example. For 
any dimension 7, there are now 


M 
Mo * my, *..* My * M4, *%..*% My) = 
Mm; 
channels. The channels in dimension 7 are named 
M 
Cyk 0 < k < ’ 
mM: 


in the order of increasing smallest node index in the channel. The 
set of nodes on the channel c,; are denoted with the same symbol, 
c;,; 1t will be clear from context whether c,;, denotes a channel or 
the nodes on it. We denote 


M 


Mm; 


rj, == avg. message rate {over time) through cy, \~k O<k < 


Consider how the channels in dimension 7 are used. There are M 
nodes in the whole system. Each node has message origination rate 
A, and so 


diet —- M) = total message origination rate. 


It is easy to show that if the hyper-rectangular inter-connection 
network is implemented with end-to-end channels in each dimen- 
sion and the assumptions A, B, and C apply, then, given any chan- 
nel c;,, in dimension 7, the message rate on it is 
oo ; M 
‘kb >= (m;-1) Wt )k O<1 <d,0<k <-—. 
mM, 
Every channel in any dimension 7? has the same traffic density, and 


the average rate (over time) through any channel in any dimension 
t is (m;-I1)d. 


Node Traffic 


Let 7 be any node in the system. Consider the message rate into 
w from an arbitrary adjacent channel c,. The probability of a 
message in cy, arriving from some node other than 7 is 
(m;-1)/m,;, and the probability of such a message being targeted to 
nm is 1/(m;-1). So the overall message rate from c,, to 7 is 

m-l 4 1 


ie a "tk 4 


Uy 
m; i 


Mm; —] 


The total message rate into 7, and because of Assumption D, the 
total message rate through 7” is thus 


d-1 d-1 m;—-1 
= = oe = df] a < ar, 
i=o Mi i=o 1% 


where m is the harmonic mean, 
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n= — 
1 
Mo 


Md-1 

In an end-to-end network, the message rate on each channel is rela- 
tively large, (m;—1)\, but the node traffic rate is dependent only on 
the number of dimensions in the network. For example, for m -ary 
d-cubes the channel traffic is O(m), whereas the node traffic is 
O(d). 


4. Point-to-Point Channels 


Consider now a point-to-point inter-connection network, where 
adjacent nodes (node addresses differing in one dimension by one) 
are connected via some type of dedicated link. Adjacent links in 
the same direction are thought to compose an end-to-end channel, 
which is denoted as cy, just as in the earlier case. There are m; 
nodes on that channel, 


Mikg Wg 0 < 3 < mM; =I; 


and m;-1 links, 


Cipj Wy i<j <m,-1, 


between them. The nodes and links are numbered as shown in Fig- 
ure l. 


Figure 1: Node and link indexes across an end-to-end channel 


Consider any end-to-end channel, cy, , in a hyper-rectangular net- 
work. If Assumptions A, B, and C apply, then all possible paths 
through c; are equally likely, each with probability 


_ i 
m,; (m;—1) © 


If cj; is any channel in dimension ¢ in a hyper-rectangular net- 
work, and cy; is any link on it, l< 7 <m,;-1, and if Assumptions 
A, B, and C apply, then 


def 
Ls tk — P {ess used | Cyk used} = 1<j <m,;-1. 


Now, if a hyper-rectangular inter-connection network is imple- 
mented with point-to-point links, and the assumptions A, B, and C 
apply, then the message rate on each link, cy; , is 


w= 2i(1-<E] 


4 


A; Vt,h,9 OSi<d, OS <—, Is7 <m,. 
i 

The above equation proves that, for example, the message rate on 
any link in the dimension 7 depends only on its distance from the 
dimension ¢ edge in the rectangular and not on distances from 
other edges of the network. 

The message rate for any dimension 7 link is largest in the middle 
of a channel, i.e., when the distance from the dimension 7 edge is 
the largest. If the busiest link onto dimension 7 is denoted as 
Cia}, then the corresponding message rate on it is 


reese DN if m; is even 
ik (m,)/2 = —g~ Ar 


fn = 
= m; 1 
Mase |" " z| , if m, is odd. 


Node Traffic 
Let 7 be any node. The traffic into 7, consists of two parts: 
traffic destined for 7, po, and traffic passing through 7 on its 


in 
"Hy ) 


thru 


way elsewhere, r,.""", or 


in _ 


dest thru 
ty toy tle - 


mo 


Similarly the traffic out of 7 


—> 


n, 


eons of traffic originating at 


Ta orig plus the same eccan tnallie, Tz asd E7 


out __,, orig thru 
te == ly + rae 


In steady state operation (Assumption D) the rates must have the 
obvious balance: 


def 
ry se pit = OU 


dest ___ 
7) vw? Te a 


orig. 


= 


We call r, the message rate at node # 


It can be shown that if the hyper-rectangular interconnection net- 
work is implemented with point-to-point (non-toroidal) links, and if 
Assumptions A, B, ©, and D apply, then the nodal message rate, 
r,, for any node 7 is eundee by : 


aS ne 


#=0 Mm; 


where in is the previously defined harmonic mean and m is the 
arithmetic mean, 

1 @2 
M; . 


For low-dimension hyper-rectangulars this means ry, can vary 
between approximately d> and =), a very broad range. This 


makes homogeneous nodes wasteful due to the tapering loads near 
the perimeter links. This can be corrected by going to toroidal 
point-to-point constructions which we consider in a moment. 


For binary hypercubes, where m==2 and d=logoM, the expected 
value for ry is exactly \d /2 and all nodes are equally loaded, since 
a Bidicectional end-to-end channel with only two nodes on it is 
equivalent to a circular (toroidal) channel organization. 


If we replace each bi-directional link, c;,;, in the hyper-rectangular 
with two uni-directional links, cy} and ec, ;, then the link message 
rates are halved, 


rag= Tag i{a- | dN, Wik J OSi<d, OSk <A, 1<j<m, 
Mm; mM; 


because the message rate on any link is the same in each direction. 
However, the node message rates remain the same, 


d-1 
=a > are 


+ =0 


5. Toroids 


Suppose now, that, for each dimension 7 
dimension, there is an additional link from the last node (m;-1) to 
the first node (0) in that row. Further assume that each link is 
now uni-directional, with messages routed only in the index order 
0-1 -+--+—m,;-1-0. Such circular structures are generally 
called toroids, and we call the ones described here as uni- 
directional circular hyper-rectangulars. All the assumptions stated 
before (Assumptions A, B, C, and D) still apply. 


We now derive the message throughput on individual links. Select 
any dimension ¢ channel c,, and consider the message rate 
through the links, ¢;, on it. The total message rate through all of 
the dimension 7 channels is ; 


m; —1 
piott pedis aie) ee 
m; 
Because of the homogeneous structure of the toroid, all channels on 
dimension ¢ are equally busy, and there are M/m; channels for 


dimension 7. Thus, the total message rate on cy, rj, iS 


, for each node row in that’ 
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r; total 


M/m, 


Let jy be any link on cj. 
through cy, and 


‘kh = d = (mM; —1) X. 


There are m;(m;-1) different routes 


m;—-1 + m,;-2 + +2+1 
of them go through j. So, the probability that a message using ¢,, 
goes through cy;, P; 1}, 35 
m,~-1 
4 
Pie Se SS 
jie “im; (m; -1) 2. 
Now, the message rate on the given link ¢;,; is 
m; —1 
tg te OT a X. 


Let #7 be any node on any channel cj, in dimension t, and let 


def 


di —dimension ¢ link leading to 7, \t O<t <d. 


The message rate through 7 is the sum of the message rates on all 


links leading to 7: 


= iG Ad 
> Tks; = > du (m;-1) = a ~1). 
¢ =0 *=0 


Bi-Directional Toroids 


Another possibility is to use bi-directional links to connect tne 
nodes in the hyper-rectangular toroid. Assumption C requires a 
balanced routing algorithm. The deductions below are made based 
on the assumption that, if two paths of equal lengths exist from 
node 7, to node 7», then either one of them is selected with proba- 
bility 1/2. If some other balanced method is selected, similar 
deductions can still be made. 


Let y be any link, between nodes 7, and 72, on any channel c;,. 
The message rate through it is now 


(m;-1)\ if m; =2, 
m; 
a a DN if m; even, m; >2, 
my; —1 
\ if m; odd, m; >2. 
4m; 


The message rate through any node nm is again half of the message 
rates of adjacent links, 


xy = Di 
Cig adjacent 
to? 


Vikj 


: = [rs * (1-05 * 1d” mae 


where 7; is an index for a dimension 7 link adjacent to 7, and 


re 1 if X is true, 


QO otherwise. 


6. Queue Lengths and Response Times 


The total hyper-rectangular network has an arrival rate of M) . 
We have already derived the overall message rates on individual 
nodes and links. To obtain the queue lengths and response times 
we need more information of the system. We need the processing 
speed for every device (node, bus, or link) in the system. Also for 
the analytic solution to be tractable, i.e., for the network to be 


separable [7|, we need two additional assumptions: 
Assumption EB: ' The system can be defined as a sequence of 


events that occur at distinct times. 


The completion rate from a server does not 
depend on the load at other servers. 


Assumption F: 


One additional assumption is needed for a simple solution to be 
available: 


Assumption G: The completion rate from a busy server must 
not depend on the queue length for that 


server. 


We also need some new notation. Let 


def 
S; == average service time at device 7 per message (sec/msg). 


For further analysis, we transform the message rates into viszt 
ratios, which define the average number of times that any message 
routed through the system passes through a device. Given the 
actual message rate, r;, through a device (here a node, bus, or a 
link), and the total message rate arriving to (originating at) the 
system, M , the corresponding visit ratio at device (server) 7 is 


Y= (visits/msg). 


vr; 
M»> 
We can now use well known operational analysis theory for open 
queueing networks ([7|]) and compute the queue lengths and 
response times for every device 7 in the system. 


The average work demand per message for device 7 is 
D; = V; 5; (sec/msg), 


and the processing capacity of the system is determined by the 
device with the largest demand, 


D wax == max D;. 
t 


The maximum system throughput, i.e., the network capacity, is 


1 
Ces : 
Dies 
and thus, we must have 
1 : 1 
Mr < », le, AL———, 
~ Dinas MD sax 


to avoid saturating the network. If \>1/MD,,,, Assumption D is 
violated, the system becomes saturated, and queue lengths and 
response times “‘explode”’ to infinity. 


In general, to compute the total average system response time, we 
need to consider all nodes, busses, and/or links. However, node 
delays can often be ignored in practice, because they are often 
included in the link service times. 


If two uni-directional links replaced each bi-directional link, and 
the maximum message rate for the uni-directional link were half of 
that for bi-directional links, then the maximum visit ratio would be 
half of that before, but the average demand would be the same. 
Thus, all other performance measures given above would be the 
same as they were for the network with bi-directional links. 


One-Dimensional End-to-End Channels 


For hyper-rectangulars with end-to-end channels, the response 
times and queue lengths can be expressed in a more condensed 
form because 


m 
V..= — for all el channels c,, in dimension ?¢, \i O0<i <d. 
cd M Mm; 


The average total channel residence time (time spent in all 


channels per average message) is 


R linke go: 


Similarly, for each node, 7, 
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and the average total node residence time is 


Gg node a{1- =| 


m 
Pa Ki 2 a[1- =] 
m 


The average total message latency is R'™** + Re | 


~ Uni-Directional Toroids 


For uni-directional toroids and for any link cy; on any channel cy 
in dimension 7, 


Mm; —] d—-} 1 


and R linke 3 > 


} 2 d 
Va — aM fu \m 1) = oar 4), 
and 
R nodes oa 1 ; 
1 


G node d (m -1) 
The average total response time is R'™* + R* Ignoring 
R ae for the moment and taking the network response time to be 
Rinks we can get a simplified form for the case of m-ary d-cubes 
(arbitrary radix hypercubes): 

m-—l 


——— (| 
Tye ene oe 

glink — 9 
We can recognize the numerator as A, the average number of link 
traversals (hops) for a message in the network [3], the first term in 
the denominator as C'""* | the single link capacity in messages per 
second, and the second term in the denominator as r’"* | the single 


link traffic rate (rj; is the same everywhere in a uni-directional 
toroidal hypercube under our assumptions). Thus 


— 


R links h 


hypercube > : ) 
yp Clink _ p link 


indicating that the network response time (exclusive of node 


service) in a hypercube is just the average message distance divided 
by the idle link capacity (in messages per second). This result is 
reported by earlier authors [5, p.272]. 


Note that the expression for the average number of hops (link 


traversals) per message, h used above, is easily derivable as 

eg kee ae __ d(m-t) 

h = Ellink traversals| = 2s Meu selene cia 
Note also that we may be justified in ignoring R"* in our 
analysis if the node is implemented so that traffic on all d dimen- 
sions is handled in parallel within the node. The above expression 
for R,, assumes that a node acts as a single server device. If, in 
fact, all of a node’s ports to the network operate in parallel, then 
the node service time, S$" can just be treated as part of the link 
service time, S"* and R”°“* then becomes zero. 


Bi-Directional Toroids 


For bi-directional toroids and for any link cy; on any channel c,, 
in dimension ¢ , 


M if mM; =2, 

Ns : 
Ves mean if m; even, m; >2, 

m;?- 

ries if m; odd, m; >2, 

and. 
d-lr - bares 

y= >> | Vea, *(1-0.5 * Id ' )], 


i==0 


where cy; is any link on any channel cy in dimension 7. Overall 
message latencies and queue lengths are then computed with the 
standard formulae. 


7. Conclusions 


We have derived simple formulae for channel and link throughput 
in end-to-end and point-to-point networks under generally applica- 
ble assumptions. The analytical link and node throughputs are 
summarized in Table 1. 


Queueing delays within the network will slow down any individual 
message, but they do not affect the message rates. Queueing 
behavior cannot be anticipated with operational analysis alone; sto- 
chastic assumptions are needed. For separable open networks we 
have relatively simple closed form solutions for queue lengths and 
response times in the system. 


An important part of the analysis was the decision to include the 
sending node as a possible target node, so that every node was 
equally likely to receive every message. This simplified the analysis 
in many places. If one rules out messages to the sending node, and 
denotes the actual message rate out from each node as £, then all 
the formulae given earlier apply for 
M+1 
= Vi B. 


As a special case, we can use the formulae described earlier to 
derive link and node throughput for m-ary d-cubes and hypercubes. 
The derived formulae are given in Tables 2 and 3. Also, the 
response times and device queue lengths reduce to simpler forms 
for these special cases. For example, when m; =2, the link queue 
lengths and link residence times become 


Glink y link d 
aia TS er =F 
2 Glink 


Remark: This paper is a shortened version of a technical report [6], 
which contains proofs for all the results presented here in addition 
to examples, supporting simulation results, and additional discus- 
sion. 
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Table 1: Hyper-Rectangular Throughput 


Message rate out from each node: 
Mo-M1-° + * —my_, Hyper-Rectangular 
y is link index from edge 


Connection Topology | Link Rate Node Rate 


one-dimensional bus 


node-to-node 


uni-dir. toroid 


bi-dir. toroid 


Table 2: Throughput in m-ary d-cubes 


Message rate out from each node: ) m;=m M-=m‘4 
Connection Topology Link Rate Node Rate 


one-dimensional bus 


node-to-node 


uni-dir. toroid 


bi-dir. toroid 


Message rate out from each node: \) m;=2 M=2¢ 
Connection Topology Link Rate Node Rate 
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uni-dir. toroid 
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Abstract 


A method for distributed termination detection is proposed that naturally 
fits the structure of an array of mesh-connected processing elements. Two 
sufficient conditions are given which guarantee that any one of the 
processing elements may detect the termination of computation on the 
mesh. The method is fully distributed, symmetric, asynchronous, and 
efficient in that it combines termination detection with the computation 
process and it does not require any global information transmission until 
termination of computations has been detected. The method was originated 
for use with parallel processing for finite element analysis on a mesh of 
processing elements, but it is applicable to any asynchronous iterative 
computations on the mesh. The method can also be used for termination 
detection when execution of successive tasks are overlapped on the mesh. 


1. Introduction 


Computation of the finite element analysis can be distributed on an array 
of processing elements (PEs). The PEs are usually connected as a mesh 
since a mesh is a good match to the grid patterns used in the finite 
element analysis. The computation can be carried out by either direct or 
iterative solution techniques. Iterative solutions have been found more 
suitable for utilizing the power of parallel processing [1] [2]. Iterative 
methods can be either synchronous or asynchronous. Asynchronous 
iterations have been attracting more attention [3] [4]. Being asynchronous, 
the computation has all the attributes of other distributed computations: a 
PE is either active in doing its computation or passive when is is done 
with its computation; passive PEs may be activated again by messages 
from active PEs; and the pattern of message transmission can not be 
decided a priori. One of the challenges in using iterative solutions is to 
determine when the computation is completed. The solution to this 
problem can be a centralized or distributed one. An ideal solution should 
have the following properties: 


(1) 
(2) 
(3) 


It does not interfere with the computation process (transparency). 
It does not require dedicated communication channels. 

It does not use a predesignated processor (host, root etc.) that 
observes the states of all the PEs, i.e., the solution should be 
fully distributed and symmetric. 

Message transmission may be delayed in communication 
channels, i.e., the transmission is not instantaneous. 


(4) 


In [1], [2] and [4] termination detection was solved by appealing to a 
global synchronization mechanism. When a PE is finished with its 
current computation, it will report its state to a predesignated PE ( called 
the host or root). The host collects the states of all the PEs and decides if 
the computation is terminated. Global termination detection is hard to 
implement in case of asynchronous iteration, since a passive PE may be 
activated again by a message from other PEs and this change in PE status 
must be made known to the host. The host may never know the real 
status of the computation due to communication delays unless message 
transmission is assumed instantaneous. Global synchronization may also 
take much extra time since global communication is usually slower than 
local communication; hence, PEs must wait for synchronization. 


Techniques for distributed termination must be used when global 
synchronization is not a good choice. The problem of distributed 
termination has been discussed in the literature [5] [6] [7] [8] [9]. The 
previous approaches have the following characteristics. 


(1) A dedicated communication network, CN, is assumed for the 
purpose of termination detection (tree in [5] [9]; ring in [6] [7] ). 
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(2) Termination detection is a continuous, trial and error process. A 
detecting probe (tokens, control message) or detecting wave is 
initiated and circulated periodically in CN until the probe has 
detected the system termination. 

All algorithms, except the one in [7], use a predesignated 
processor to detect termination. While the algorithm in [7] is 
distributed and symmetric in the sense that any processor may 
detect termination, it uses a common clock which is not 
desirable in practice. 


(3) 


There are drawbacks in these approaches. First, CN can not be used for 
transmitting a computation message (data message). Otherwise, the speed 
of termination detection and computation would both be reduced. There 
must be another network for computation messages; thus, CN adds 
hardware and complexity to a parallel computer system. The second 
problem with these methods is the large number of messages travelling in 
CN. Some messages that will eventually be destroyed must pass through 
several processors even though they have become obsolete at the 
beginning of the journey [7]. The problems are inherent in the ring 
structure and the assumption that data messages can be passed between 
any processor pair. 


The situation is different in a mesh-connected array of processors. 
Communication in the mesh is through nearest-neighbor connections 
which are fixed and local. Only local messages between adjacent PEs 
exist. Taking advantage of the mesh structure, a token (or probe) passing 
method is proposed for distributed termination of computation on a 
rectangular mesh. When finished with its computation, each PE sends its 
termination state to its neighbor PEs. One token from each side of the 
rectangular mesh is initiated once. The tokens will be travelling in the 
mesh according to rules derived below. Their traces and positions will 
indicate the computation status of the mesh. One or more of the PEs will 
eventually be able to detect termination of the computation by examining 
its own record of the token arrival, its state, and its neighbor's states. The 
method is fully distributed and symmetric in the sense that no PE has 
more responsibility than the others [7]. 


The following paragraphs describe the proposed method. Some 
assumptions about the rectangular mesh and PEs as well as some 
definitions are given in Section 2. The case where there is no data 
message (values for actual computation) is considered in Sec. 3. In this 
case a PE will keep passive once it is in passive state, which makes it 
easier to detect termination. The method is improved further in Section 4 
so that it works even if data messages exist. Correctness proofs, time 
analysis and discussions are given in Sections 5 and 6. 


2. Assumptions and Definitions 


The definitions of array, states, arrows, tokens, data messages and control 
messages are given in this section. Some assumptions about the mesh are 
also listed. 


As shown in Fig. 1, the array consists of mesh-connected processing 
elements (PEs) with each PE connected to its four nearest neighbors. A 
PE may be in one of two states: "active" and "passive". A PE is active 
when it is contributing to the computation and passive when it is finished 
with its assigned computation. The two states are indicated by circles and 
black dots, respectively. Token "North passive" indicates that all PEs 
north of (not including) the west-east line where the token resides have 
been passive. The definitions of the other tokens are similar. The tokens 


° Ed ia 
passive PE South passive North passive 
° Hi O 
active PE West passive East passive 


Fig. 1 Structure and notations. 


are represented by black-and-white squares. They are messengers travelling 
in the mesh to collect global information about the state of the mesh. 
Totally, there are four tokens, one for each side of the mesh. The state of 
passiveness of a PE can be passed to other PEs. This message is 
represented by arrows pointing from the sender PE to the receiver PE. An 
arrow is a passive state messenger. Messages are grouped under data and 
control. Data messages are those that carry values used in computation. 
Control messages carry information on the state of a PE, e.g., an arrow or 
a token. A passive PE may be activated again by data messages from 
other PEs. A passive PE may communicate with other PEs by control 
messages to decide the status of the mesh as a whole. 


Computation is completed when all PEs are passive and there is no data 
message in any communication channel. Detection of this state by one of 
the PEs is called the distributed termination problem. ~ 


One assumption is that message transmission is instantaneous, which 
was an assumption made in all previous papers on distributed termination 
known to us. Another assumption is that there is a continuous 
communication process that handles messages between PEs no matter 
whether computation is in process or not. 


To simplify the discussion, some combinations of the arrow reception 
patterns are named as shown below. 


toe of 


Normal 


enfiine 


Single Collision Cross 


Single and Collision are the most important patterns. 


There are two ways in which tokens may be passed. One is called shift- 
pass and the other cross-pass as illustrated below: 


A 
—_Po——_ > 
--> A A 
———_ pomal—— 
shift-pass CrOSS-pass 


For example, consider the south passive token which tends to travel to the 
north in the mesh. Assume that the token is impending on PE A. Shift- 
pass simply shifts the token along the horizontal line through PE A. This 
movement is caused by the single or normal reception patterns. Cross- 
pass sends the token across its horizontal line; this movement is caused 
by the collision reception pattern that is east-west oriented. 
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3. A Distributed Termination Solution In the 
Absence of Data Messages 


We describe a method for detecting, in distributed fashion, termination of 
computations without a data message on the mesh. The absence of data 
messages makes the detection problem simpler since a PE can not be 
activated again once it is passive. ( This is removed later in Section 4 by 
a modification of the method). The method is based on three sets of rules: 
state transition rules, arrow passing rules, and token passing rules. A PE 
will maintain a record of the arrival of tokens and arrows from its 
neighbors. The principle behind the method is that a token will never 
cross a line of PEs if any PE on this line has never been passive. A 
tokens location indicates that PEs on lines passed by the token have been 
passive. The rules are listed below. 


State Transition Rules: 


(0) Boundary PEs are always passive and ready to pass control 
messages (the arrows and tokens). 

(1) APE is activated by initialization or program loading. 

(2) A PE becomes passive if it is finished with the assigned 
computation. 


Arrow Passing Rules: 


An active PE does not pass any arrow. A passive PE does not pass arrows 
before receiving arrows from its neighbors. 


(0) A boundary PE passes its arrow to all its neighbors as soon as 
the computation starts. 

(1) A PE passes its own arrows in the opposite directions of the 
arrows it has received. There may be two cases as depicted 
below. 


Case 1: Single state 


arrow received 
Te arrow passed 
Case 2: Normal state 
arrows received 
—— > o- - - - > 
| 
| 
v arrows passed 


(2) A PE passes its arrow to both neighbors if it receives two 
opposite arrows (Collision) from its neighbors. 


Rule (0) assures starting the termination detection process. Rule (1) 
makes the state of passiveness propagate along a line. And Rule (2) is to 
send the line passiveness information to all the PEs on the same line. 
This information will be needed to decide whether a token should be cross- 

passed. 


Token Passing Rules: 


There are four different tokens residing initially on the four boundaries of 
the mesh. The initial locations of the tokens are not important as long as 
they are on their corresponding boundary sides. A passive PE may deliver . 
tokens to other PEs when it receives tokens. The following is a list of 
rules for token passing. 


(0) Boundary PEs cross-pass their tokens as soon as the computation 
Starts. 

(1) A PE in Single or Normal states shift-passes the tokens it has 
received in the direction of its arrow. 

(2) A PE in Collision or Cross states cross-passes the tokens it 
receives. 


(3) A boundary PE keeps any tokens it has received. 


The arrival of the tokens will be recorded by PEs that they have visited in 
order to detect termination of the computation. Termination may be 
detected by any PE. The following are two sufficient terminating 
conditions : 


CONDITION 1 Any PE can declare that the computation is terminated 
when the PE has a record of the arrival of all four tokens. 
The tokens may or may not be with the PE at the time 
of the decision. 


CONDITION 2 A boundary PE can declare that the computation is 
terminated when it receives one token since this token 
must be from the side opposite to the PE's side. 


The correctness of the conditions can be explained using the definitions of 
the tokens. There are four sides on a rectangular mesh. Each token 
indicates that its corresponding side is passive. The whole mesh is 
obviously passive once all four tokens meet at one PE, i.e., all four sides 
are passive. The first condition will occur if there is one and only one 
Cross on any line of the mesh and there is no Collision. The second 
condition will occur when there is no Cross or more than one Cross on 
any lines of the mesh. 


An alternative technique is to use one token and let it travel from one side 
to the opposite side of the mesh. This will be sufficient for termination 
detection according to Condition 2. Detection time may be shorter if four 
tokens and Condition 1 are used. 


4. A Solution with the Existence of Data Messages 


When data messages exist, a passive PE may be activated by data 
messages from another PE. One fact is that a PE that sends a data 
message destroys the arrow from a passive PE that will receive the 
message. Termination can also be decided when data messages are present 
if another state transition rule is added: 


State Transition Rule (3): 


(3) Ifa PE sends a data message to a passive PE, the sender may not 
declare itself passive until the receiver becomes passive again. 


This rule guarantees that the token indicating that the receiver was passive 
will never be cross-passed to the next level by the sender or any PEs on 
the same line as the sender ‘unless the receiver becomes passive again. 
Now we can use all the rules and the two sufficient conditions listed in 
Section 3 to detect termination when data messages exist. 


5. Correctness Proof and Primitive Time Analysis 


We prove that Condition 1 and 2 in Section 3 are sufficient even in the 
presence of data messages. Proof of the first condition reads as follows: 
assume that a passive PE receives four tokens while the system is not 
terminated. There must exist one originally active PE (i.e., it has never 
been passive before) according to the state transition rule (3) specified in 
Section 4. But then the tokens should not have crossed the two lines on 
which this PE resides, which implies that no PE could have received four 
tokens. This contradicts the assumption that the passive PE has received 
four tokens, Proof of Condition 2 follows the same path. After any token 
has crossed the mesh from one side to the opposite side, there would exist 
no originally active PEs, which implies that every PE has been passive 
and the system is terminated. 


Consider a square mesh of n PEs (Vn on each side). Assuming that the 
.system is already terminated, the worst time for the algorithm to detect 
the termination is O(n), which is the time for one token to travel through 
a ring that connects every PE. The best time is O(Vn) (order of square root 
of n) which is the time for a token to travel across the mesh or four 
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tokens meet at the center of the mesh.The above time estimates represent 
the worst case situation. Since the token passing is performed in parallel 
with actual computations, the execution of our termination algorithm 
may be completely overlapped with actual computations so that 
termination detection could be completed as soon as computation is over. 
In comparison, the algorithm developed in [7] will always take O(n) time 
where n is the number of PEs on a ring since it is the last token (the 
counter in the paper) that starts the terminating wave. 


6. Discussion 


The proposed method discussed in this paper has all the merits that the 
other methods for distributed termination detection claim: 
asynchronousness, distributiveness, symmetry, etc. It is also more 
efficient, faster and better than the ring-based algorithms. It is efficient 
since no additional communication network is needed; messages are only 
passed the shortest distance possible when they are needed, and global 
communication is not required until the computation terminates. It is 
faster since messages may be travelling in parallel with each other unlike 
sequential message passing in the ring approach. It is better since the 
mesh structure is fully utilized. 


The application of the method is not limited to the finite element 
analysis. It is useful for any distributed termination detection on an array 
of mesh-connected processing elements. An immediate example is to 
solve nonlinear partial differential equations, where iterative solutions 
techniques are essential [10] [11]. The method is also applicable to multi- 
task cases. Tokens, control and data messages may be colored to represent 
different tasks so that a few finite element analysis computations may be 
running simultaneously. Another application of the algorithm could be to 
the solutions of three-dimensional partial differential equations. 
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ABSTRACT 

We show that by exchanging any two independent edges in any 
shortest cycle of the n-cube (n > 3), its diameter decreases by one unit. 
This leads us to define a new class of n-regular graphs, denoted TQ,, with 
2" vertices and diameter n — 1, which has the (n—1)-cube as subgraph. 
Other properties of TQ, such as connectivity and the lengths of the dis- 
joints paths are also investigated. Moreover, we show that the complete 
binary tree on 2” — 1 vertices, which is not a subgraph of the n-cube, is a 
subgraph of TQ,. Finally, we discuss how these results can be used to 
enhance existing hypercube multiprocessors. . 


1. INTRODUCTION 

The possibility of interconnecting a number of processors together to 
solve very large problems in scientific computation has been extensively 
considered in the past [HwBr84]. Distributed-memory multiprocessor sys- 
tems have proven to be one of the most straightforward and the least 
expensive methods to build such arrays with hundreds or even thousands 
of processors [Seit85]. In such networks, each processor has its own 
memory and message passing is the means of information exchange 
between processors. 

It is well-known that the topology of the interconnection network 
plays a significant role in system performance, especially for large scale 
distributed-memory multiprocessors [SaSc85]. Several efforts on design- 
ing interprocessor communication networks have been reported [WuLi81]. 
Among various architectural configurations, the point-to-point topology 
has attacted a great deal of attention due to its simpler communication pro- 
tocols and direct communication paths among the nodes [HwGh87]. 
Several features have to be considered when evaluating a point-to-point 
interconnection network. These features include the ability to embed other 
problem topologies, the ability to meet the demands of massive parallel- 
ism, the connectivity, the worst case communication delay between two 
nodes, the tolerance of faulty components, the communication bandwidth 
of each node, and the ease of routing between any two nodes. 

Among point-to-point topologies, the hypercube has been a dominat- 
ing topology used in the first generation of distributed-memory multipro- 
cessors [ShFi88]. The strong connectivity of hypercube and its regularity, 
symmetry, and ability to embed many other topologies, have made it a 
powerful candidate for a wide class of applications [Foxg86]. Many other 
interconnection topolgies have been proposed for distributed-memory mul- 
tiprocessors, such as tree [DePa78], cube-connected cycle [PrVu81], 
block-shuffle hypercube [HsYZ87], and hypernet [HwGh87]. These vari- 
Ous interconnection topologies have their own advantages and disadvan- 
tages based on the above evaluation criteria. In this paper, we present the 
least expensive approach to enhance the hypercube interconnection 
scheme. : 

An n-dimensional hypercube multiprocessor consists of N = 2” pro- 
cessors interconnected as follows. Each processor is labeled by a different 
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n-bit binary number (b,,-1b,-2 * : : 510). Two processors are connected by 
a full duplex link if and only if their binary labels differ in exactly one bit 
position. The popularity of hypercube multiprocessors is due to its underly- 
ing topology which is known as the n—cube graph Q,. The n-cube graph 
has been the subject of many research projects in recent years, mainly 
because of the availability of hypercube multiprocessors [SaSc85]. As a 
result, many properties of the n-cube have been discovered [BrSc85]. 

The rest of this paper is organized as follows. Our notation and ter- 
minology are given in the next section. A new interconnection topology, 


denoted TQ,, which is based on a simple modification of the n-cube is 


given in Section 3. We will show in Sections 4 through 7 that TQ, has cer- 
tain topological advantages over Q,. In particular, it is shown that the 
diameter of TQ, is one less than that of Q,, and its vertex-connectivity is 
the same as that of Q,. It is known that the complete binary tree on 2” — 1 
vertices, T,, is not a subgraph of Q, [SaSc85]. However, T,-1 is contained 
in Q, [BrSc85]. We prove that TQ, has the complete binary tree T,, as sub- 
graph. Other subgraphs of TQ, are also identified. Finally, practical impli- 
cations of our results are given in Section 8. 


2. NOTATION AND TERMINOLOGY 

We will closely follow the graph theoretical terminology and nota- 
tion of [Hara72]; terms not defined here can be found in that book. Let 
G(V,E) represent a graph with point or vertex set V(G) = V and edge set 
E(G)=E. If an edge e=uve E then vertices u and v are said to be 
adjacent, the edge e is said to be incident to these vertices, and u and v are 
the end points of edge e. Two edges are said to be independent if they do 
not share an end point. For a vertex v € V, J(v) represents the set of all 
edges incident to v in G, and its cardinality |/(v)| is the degree deg (v) of 
vertex v. We denote by 5(G) and A(G) the minimum and maximum degrees 
respectively of vertices of G. If &(G) = A(G)=&, then G is said to be 
k—regular. For a set X CE (or X CV), the notation G —X represents the 
graph obtained by removing the edges (vertices) in X from G. The vertex- 
connectivity, «(G), of a graph G is the least. cardinality |X| of a set 
X cV(G) such that G —X is either disconnected or consists of a single 


' vertex. 
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The distance d(u,v) between two distinct vertices u and v is the 
length (in number of edges) of a shortest path between these vertices. The 
diameterd(G) of graph G_ is’ then’ defined to be 
d(G) = max{d(u,v)| u,v € V}. If H and G are graphs then H is iso- 
morphic to a subgraph of G if there is a one-to-one function 
f:V(H) > V(G) such that each edge uv € E(H) is carried to an edge 
f (u)f (v) € E(G). By an abuse of language we will often merely say that 
H is a subgraph of G (where in reality it is f (H) which is a subgraph of G) 
and will write H CG. 

Two specific graphs with which we will be concerned are complete 
binary trees and n-cubes. As indicated before, T,, will represent the com- 
plete binary tree on 2”—1 vertices. The root of T, is the unique vertex 
whose degree is 2. If Q, is the n-cube then 5(Q,) = A(Q,) =n, d(Q,) =n, 
and x(Q,,) =n. The binary label of a vertex v € V(Q,) will be referred to 
by an n-bit binary number b(v). Also, O(b(v)) and Z(6(v)) will denote 
the number of ones and zeros, respectively, in the binary number b (v). 


3. THE TWISTED N-CUBE 

Let C be any shortest cycle (i.e., a cycle of four vertices) in Q,. 
Also, let ux and vy be any two independent edges in C. The 
twisted n—cube graph TQ, is then constructed as follows. Delete edges ux 
and vy from Q,. Then, connect, via an edge, vertex u to vertex y, and ver- 
tex v to vertex x. That is, TQ, = Q, — (ux, vy) + (uy, vx). Figure 1 shows 
Q;3 and TQ3. Note that by construction, TQ, is n-regular just as Q,, is. 
Also, observe that TQ, has two disjoint Q,_; as subgraphs. 


Although the cube can be twisted around any 4-cycle, we will usually use 
the canonically twisted Q, where vertices u, v, x, and y have the labels 
b(u) = 000 --- 0, b(v)=010---0, b(x)=100---0, and 
b(y)=110---0. In the subsequent sections we describe some of the pro- 
perties of the twisted cube 7Q,,. 


4. DIAMETER OF TQ, 


It is well-known that d(Q,) =n. Also, between any pair of vertices u 
and v in Q, there are n disjoint paths, of which d (u,v) are of length d(u,v) 
and the rest are of length d(u,v)+2 [Kuhl80, SaSc85]. As a result, if 
d(u,v) <n-—1 then there are at least n—2 disjoint paths between u and v, 
each of which is of length at most n—1. This property of Q, will be used 
shortly. 
Theorem 1: d(TQ,,) =n-1, for n 2 3. 
Proof. Let TQ,, be the canonically twisted cube. The theorem can easily 
be verified when n equals 3 and 4. Thus assume that n > 5. Now, let s and t 
be any two vertices in TQ,. We will show that in TQ, we have 
d(s,t)<n-—1 for all s, t with equality for at least one pair. Depending on 
the value of d(s,t) in Q,,, the following two cases are considered. 
Case 1: In Q, we have d(s,t)<n-—1. Then there are at least n—2 > 3 dis- 
joint paths between s and ¢ in Q,, each of which is of length at most n—-1. 
Thus, removal of edges ux and vy from Q,, can destroy at most two of such 
paths. This implies that in TQ, we have d(s,t)<n-1. 
Case 2: In Q, we have d(s,t)=n. Let b(s) = (6,1 5,-2Dn-3 * ++ bb) so 
that b(t) = (6,_1b,_-2b,-3 +++ b1b9) where b; is the binary complement of 
b;. A shortest s—t path in TQ, can be constructed as follows. 
. First concentrate on the ones of b(s) in positions n—3, n—4, ---, 0. 

We can change these ones to zeros by traveling over a single edge for each 
exchange. Thus, after traveling O(b,_3b,-4 - >: b 9) edges we will arrive at 
one of the vertices u, v, x, or y (which one is determined, of course, by the 
two leading bits b,_,b,_-2 of s). Next, we can change b,_1b,~2 to b,-1b,-2 
by using a single edge of TQ,. That edge will be uy or vx depending upon 
which of the four vertices we were led to by the first part of the path. 
Finally, all the zeros in (b,-3b,-4 *** bo) must be turned to ones. Again a 
single edge is used for each of the Z(b,,_3b,-4 * ++ bo) bits involved. Hence 
the total number of edges in our s—t path is 
O(b,-30,-4 ae bo) + Z(b,-30n-4 ie bo) +1= (n-2) +l=n”-1. 

It is easy to see that there is no shorter s—t path: traveling over any 

edge of TQ, changes only one bit with the exception of uy and vx which 
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change two. It can be easily seen that edges uy and vx cannot both appear 
in any shortest path. Since b(s) and b(t) differ in all n positions, at least 
n—1 edges are needed to transform all bits. It follows that d(s,t) =n—1 by 
the construction above. Combining this fact with Case 1, we see that 


d(TQ,,) =n-—1 as desired. 0 


5. VERTEX-CONNECTIVITY OF TQ, 

It is known that «(Q,)=n [ArGr81, Kohl80]. We next prove that 
«(TQ,,) =n. In fact we prove a more general connectivity theorem. Let G, 
and Gz be two connected graphs with the same number p of vertices. 
Furthermore, let V(G,) = {uy, U2, *°*, Uy} and 
V(G2)= {¥1, v2, °°, vp}. Then H=G; © Gy» represents the graph 
obtained by taking G, and G» and connecting, via a new edge, vertex u; to 
vertex v;, for 1 <i<p. That is, 

V(H)=V(G1) UV(Go) 

and 

E(#) = — (G,) UE (G>) U {u,v ujeE (G1), v;EE (Go), 1<i<p}. 
The u;v; edges will be referred to as cross edges. Note that operation © 
may generate different H graphs depending on how the vertices in graphs 
G, and G2 are labeled [Hede69]. 
Theorem 2: Let G, and Gy be connected graphs defined as above, and let 
H=G, © Gy. Then K(H) = 1+ min(K(G,) , K(G>2)). 
Proof: Let k =min(k(G,), «(G2)), and let X be an arbitrary subset of 
V(H) such that |X| =k. We prove the theorem by showing that H —X is 
connected. Observe that H contains at least k + 1 cross edges since k must 
be smaller than the number of vertices in each of the graphs G, and G2. 
Therefore, romoval of k vertices from H cannot cause deletion of all cross 
edges. Now if X VN V(G1)=© (respectively X NV(G2)=©) then G, 
(respectively G>) is a connected subgraph of H — X. Furthermore, every 
remaining vertex of G» (respectively G,) is connected to this connected 
subgraph. Hence H — X is connected. 

Now suppose X 1 V(G1) =X, #© and X NV(G2)=X2 #0. We 

must then have 1 <|X,| <k-—1 and 1<|X.| <k-1. This implies that both 
G,—X, and Gy —X, are connected by definition of k. Since there is at 
least one cross edge, say e, in H — X, the end points of e lie inG, — X, and 
G»—X>, and therfore H — X must be connected. 0 
Theorem 3: «(TQ,,) =n. 
Proof: Clearly it is possible to take two copies of Q,_, and label their ver- 
tices such that T0,=Q,-, © Q,-,. Since «(Q,-1)=n-—1, Theorem 2 
implies that K(TQ,,) =n. Also for any n-regular graph G, «(G) <n, hence 
the desired result. 0 


6. LENGTHS OF DISJOINT PATHS IN TQ, 
It is well-known that if G is a graph with k(G) =n, then given any 
two distinct vertices s, te V(G) we can find n disjoint s—t paths in G 
[Hara72]. The following theorem gives an explicit description of such 
paths in 7Q,. Its proof, which is long, is omitted due to space constraint, 


but can be found in [EsNS88]. 


d-1 d d+1 d+2 d+3 


Exception 


1. There is one fixed 1 in suf (5) and 
(a) Dn-1 Bn-2 = Cn-1Cn-2; or 
(b) either 5, f are adjacent to HU, X or 

5S, t are adjacent to V, y 


2. There is no fixed 1 in suf (s) and 
(a) bn-1 Dn-2 = Cn-1Cn-2s or 
(b) bn-1On-2 = Ca—1Cn-2 
with exactly one of S, f equal 
tou, V, X, y, or 
(c) By-1 bn_2 = Cn-1€n-2 
with exactly one of 5, f equal 
tou, V, X, y, or 
(4) By-1 On—2 = Cn-1Cn-2 
with § or f equal 
tou, V, Xx, y 


Theorem 6: Let TQ, be the canonically twisted n-cube and consider 
s,te V(TQ,) with b(s)= b,-1),-2, °**, Bo and 
b(t) =Cp-1Ca-2» °° *» Co- If d(s, t) =d in Q, then a set of n disjoint paths 
consisting of d of length d and n—d of length d+2 continues to exit in TQ, 
with the exception of the cases noted in the following table. If the entry in 
row i and colunm d+ is k this means that there are k disjoint s—t paths of 
length d+j for exception i. A blank indicates no such paths. All paths for a 
given row can be taken to be disjoint. 0 


Note that in all exceptional cases but two (specifically 1(b) and 2(a)) 
the average length of the n paths in the table is at least as short as the aver- 
age length between the same two points in Q,. In fact for some of the cases 
above the paths from Q, still exist in TQ,, but the listed set of the paths will 
be shorter. 


7. SOME SUBGRAPHS OF TQ, 

In this section, we will identify some of the subgraphs of 7Q,. By 
construction, Q,_; is a subgraph of TQ, and thus all its subgraphs are con- 
tained in TQ,. In fact any subgraph of Q, which does not contain two 
independent edges belonging to some 4-cycle of Q,,, is also contained in 
TQ,. This implies that TQ, contains a 2”-cycle, and any 2-dimentional 
mesh which is a subgraph of Q,,. While Q, contains only even cycles, TQ, 
conatins odd cycles as well. 

In what follows we will show that the complete binary tree on 2” — 1 
vertices, T,, is a subgraph of 7Q,. It is known that 7, is not a subgraph of 
Q, [SaSc85]. However, T,,_; is contained in Q, [BrSc85]. To present our 
result, we need to show that two disjoint copies of 7,,_; can be found in 
Q,. This was first demonstrated by Prada, rediscovered independently by 
Bhatt and Ipsen, and then re-rediscovered by us [Prah74, BhIp85]. We 
include the proof for the sake of completeness. 

Let S, denote the graph obtained by taking two disjoint complete 
binary tree T,,_, and connecting their roots by a path of length 3. A picture 
of S 4 is given in Figure 2. 

Theorem 7: For n =2, S, is a subgraph of Q,. Furthermore, for n = 3 the 
roots r and u of the 

two copies of 7,,_; can be labeled so that b(r) and b(u) differ in exactly 
three positions. 

Proof: Figure 3 gives labelings which embed S, in Q, for n =2, 3. Note 
that in the latter case the labels along the path r—s—t—u of length 3 are 
001, 011, 111, and 110 respectively. By induction we may assume that 
S,~1 18 isomorphic to a subgraph of Q,-_, with 


b(r)=001--+1 
b(s)=011---1 
b(t)=111-+<1 
b(u)=11--- 10. 


Now we can find two disjoint subgraphs isomorphic to S,,_;, call them Sa 
and S!_,, in Q,-; as follows. S°_; is obtained by prefixing every label of 
S,,-1 with a 0. Thus the labels of the corresponding path of length 3 are 

b(r°) = 0001 --+ 1 

b(s°) =O0011---1 

b(t®)=O111-+-1 

b(u®) =011--- 10. 
If v € S,_; is labeled b(v) = (b,-nbn-3°** bo) in Q,-1 then in S}_, we let 
b(v!) =(1b9b, «++ b,-p). In particular, 

b(r!) =11--- 100 

b(s!)=11--- 110 

b(t!)=11---111 

b(u'!)=101--- 11. 
A schematic drawing of these subgraphs is displayed in Figure 4(a). Now, 
the graph S,, is created by letting | 

S, = S94 Shy + (su, 1°r', us} — (29u, thu") 

as in Figure 4(b). Finally the new roots are s° and s’ with labels 001 --- 1 
and 11--- 10 respectively, which differ in exactly three positions. 0 


Theorem 8: T,, is subgraph of TQ,. 
Proof: Find a subgraph of Q, which is isomorphic to S, with the path 
r—s—t—u labeled as in Theorem 7, that is 


b(r)=001---1 
b(s)=011 1 
b(t) =111 1 
b(uy=11---10 


If vy € V(Q,) has label b(v) = 101 - - - 1 then we can construct 


TO, = 2, — {rs, tv} + {rt, sv}. 


Clearly T, =S, + {rt} — {s} is a subgraph of TQ, (note that tv € E(S,) so 


_ that the removal of this edge from Q,, causes no difficulties). 0 
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Figure 3: Embedding S, in Q, for n = 2, 3 


(a) Two copies of S,,- 


(b) Constructing S, using two copies of S,_; 


Figure 4. 


8. CONCLUDING REMARKS 

The hypercube interconnection topology, due to its powerful topo- 
logical properties, has been widely adopted in the construction of 
distributed-memory multiprocessors. In this paper, we have shown that by 
exchanging any two independent edges in any shortest cycle of the hyper- 
cube, an interconnection topology, namely TQ,, can be achieved which 
has some nice properties. Existing hypercube multiprocessors can be 
modified to take advantage of this new topology in two ways. A hypercube 
can be converted to TQ, by exchanging two of its physical links. Second, 
two extra physical links can be added to a hypercube multiprocessor to 
obtain a topology which has both Q, and TQ,, as subgraphs. In both cases, 
other components of the system should be modified accordingly. One 
major component is the router at each processing node. In what follows we 
address this issue for both cases. 

Each processor (vertex) in the hypercube multiprocessor has a router 
to handle the interprocessor communication [LaNE87]. The function of the 
router may be performed by the processor or by a dedicated router chip. In 
a hypercube multiprocessor, upon receiving a message, a routing tag 
(Tn-1'n-2'n-3 °° ° 10) 1S Obtained by taking a bit-wise exclusive-OR opera- 
tion between the router’s local address (c,_1C,-2 *** Co) and the destina- 
tion address (d,_,d,-2.°** do) of the message. The message can then be 
forwarded to one of the neighboring processors through the j-th link if 
rj=1forO<sj<n-l. 

To support the TQ, topology, the function of the routers should be 
slightly modified. For these routers, the routing tag is computed as above. 
Suppose TQ,, is the canonically twisted n-cube. Let’s first consider the four 
routers at vertices u, v, x, and y; we will refer to these routers as twisted 
routers. If 7,-17,-2=01 then the message is forwarded through the 
(n—2)-nd link, that is, either uv or xy. If r,_,7,-2 = 11 then the message is 
forwarded through the (n—1)-st link, that is, either uy or vx. Note that in 
this case one routing step is saved compared with that in Q,. If 
rn-1’n-2 = 10 then the message is forwarded through the (m—2)-nd link if 
r;=0 for all O<j <n —3, otherwise, the message is forwarded through 
some j-th link with 7; = 1 where 0 <j <n—3. Note that in the former case, 
it will take two routing steps rather than one as required in Q,. However, 
this additional routing step may not be necessary if the message is for- 
warded through other links first as in the latter case. 

The function of the remaining 2” — 4 routers will also have to be 
slightly modified in order to take advantage of a possible saving of one 
routing step. If r,_17,-2 = 11 then one routing step can be saved by first 
forwarding the message to the node d,_,d,-200--:0, one of the four 
twisted routers, and then the message is forwarded to the final destination. 
If 74-17n-2 = 10, then the message has to be forwarded through the (n—1)- 
th link if there exists only one j (0 <j <n—3) such that 7; = 1. This is to 
avoid having an additional routing step. For all other cases, the message 
can be forwarded to any j-th link so as long as r; = 1. 

For the case where two edges (i.e., wy and vx) are added to Q,, the 
routers are modified as follows. For the four twisted routers, now each with 
n+1 links, if r,_17,-2 = 11 then the message should be forwarded through 
the added link. Thus, one routing step is saved. For all other cases, the nor- 
mal routing procedure should be followed. For the remaining 2” — 4 
routers, if 7,-17,-2 = 11, then one routing step can be saved by first for- 
warding the message to the node d,_;d,-,00---0, one of the four twisted 
routers, and then the message is forwarded to the final destination. 

‘In summary, the twisted n-cube, TQ,,, has the following properties as 
the n-cube Q,. TQ, consists of two disjoint Q,_,; subgraphs. Even rings 
and 2-dimensional mesh are subgraphs of TQ,. TQ, is n-regular and its 
vertex connectivity remains n. In addition, TQ, has the following unique 
properties not possessed by Q,. Any odd length ring with 2” — 1 or fewer 
vertices is contained in 7Q,. A complete binary tree with 2”—1 vertices, 
which is a highly demanded topology by many applications, is a subgraph 
of TQ,,. The worst case number of routing steps is reduced from n to n—1. 
Furthermore, the average number of routing steps is also reduced. This 
implies improvement on communication delay which is critical to system 
performance. 
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RELIABILITY OF THE HYPERCUBE 
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Urbana, Illinois 61801 


ABSTRACT 

Several analytical models are presented and solved in this paper for the 
subcube reliability problem associated with the hypercube multiprocessor 
architecture. This problem refers to the ability of a binary d-cube, in the 
presence of component failures, to embed disjoint functional subcubes of 
various sizes in the damaged structure. Partitioning of a fault-free 
hypercube into subcubes that can be allocated to different tasks is a 
common practice, so that the degradation in this ability is a good meas- 
ure of the effect of failures. We provide models that account for node 
failures only, link failures only, or both node and link failures. These 
show that the architecture is quite resilient to failures in terms of the 
ability to salvage functional subcubes out of a damaged hypercube. 


1. INTRODUCTION 

The boolean d-cube network [7,9,10] consisting of processing elements 
placed at the vertices of a d-dimensional hypercube, has proved to be a 
popular structure for direct-connected multiprocessor systems. The 
reader is referred to the various papers in [5] for the basis of this popu- 
larity. In this paper, we are concemed with a particular aspect of the 
hypercube architecture, viz., its reliability. A hypercube system has 27 
nodes (processing elements) and d27~' (full duplex) connecting links. In 
large systems of this kind, it is obvious that some components of the 
system will fail before long, so that characterization of the degraded sys- 
tem is important to determine how many of the failures can be tolerated. 
In the next three sections of this paper, we provide analytical models for 
a particular version of this reliability problem. The rest of this section 
will be devoted to terminology and problem definition. 

Let us begin with a brief summary of the structural and topological 
properties of the hypercube that are relevant to our analyses. A complete 
discussion of this subject can be found in [2,8]. The N =2* nodes of the 
d-dimensional hypercube can be labeled using d-bit addresses and the 
connections between them specified as follows: two nodes whose 
addresses differ in exactly one bit position i, 0<i<d-—1, are connected 
by a link. This link is said to span dimension i of the cube, so that each 
of the d dimensions has N/2 links spanning it. We refer to d as the 
order of the hypercube. Each node in a d-cube has degree d and the dis- 
tance between two nodes x and y, whose addresses differ in j bit posi- 
tions, is given by the Hamming distance H(,y) =/. Fig. | shows the 
structure of a binary 4-cube. 

A j-subcube of a d-cube is a subgraph consisting of 2/ nodes (and the 
connecting links between them) obtained by choosing d—j dimensions i ,, 
i2,...4g-;, and considering all the nodes that have the same address bit in 
each of these bit positions. Fig. 1 illustrates two 3-subcubes 
(highlighted) in a binary 4-cube. Such a j-subcube can be thought of as 
being generated by the following process: split the d-cube across dimen- 
sion i,, Separating it into two (d—1)-subcubes, each consisting of nodes 
containing the same bit in position i,; choose one of these cubes and 
split it across dimension i», resulting in two (d—2)-subcubes, etc., con- 
tinuing until a (j+1)-subcube is split across dimension i,_; to result in 
two j-subcubes. 

This recursive construction of the hypercube from smaller subcubes 
proves very useful in task allocation and partitioning the cube for appli- 
cations. The subcubes have all the structural properties of the larger 
cube so that many of the algorithms designed for hypercubes may be 
written with the order of the available cube as a runtime parameter. The 
AXIS operating system for the NCUBE multiprocessor, for instance, per- 
mits the main cube array to be shared among two or more tasks, allocat- 
ing the subcube of the appropriate size to each task [4]. Because the 
subcubes are disjoint from each other, allocation of the partitions is par- 
ticularly simplified and each task considers itself as working on an i-cube 
(with nodes relabeled accordingly). It is possible to view an incoming 
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task as a set of interacting modules that have to be assigned to the nodes 
of a subcube with adjacencies between modules in the task graph being 
preserved in the subcube; algorithms have been developed to determine 
the size of the subcube required for each task under this condition [3]. 
In addition, efficient algorithms for many applications are designed to 
exploit the subcube partitioning ability of the hypercube, quite often in a 
recursive or divide-and-conquer fashion [6,11]. It is useful to see how 
much of this ability is lost when failures begin to occur in the system. 

So the problem that we will address in this paper can be expressed as 
follows. We seek expressions for the reliability and mean time to failure 
of a d-cube system for a variety of breakdown conditions. These condi- 
tions will be defined in terms of the ability to embed disjoint subcubes of 
different sizes in a hypercube. In all but one of the cases that we con- 
sider, we analyze conditions in which at least a (d—1)-subcube, the larg- 
est proper subcube, is functional. Additional disjoint subcubes of smaller 
sizes could coexist, and these will be captured in the definitions of vari- 
ous system states. Section 2 considers this problem under the node 
failure model, in which only the effect of node failures will be con- 
sidered. Section 3 develops a similar model for the link failure case, and 
Section 4 analyzes the system when both link and node failures are per- 
mitted in the model. In Section 5, we present schemes for remapping 
the node addresses so that functional subcubes can be salvaged from the 
damaged hypercube system. 

For an alternate formulation of the reliability problem for the hyper- 
cube, and some issues related to its fault tolerance, see [1]. 


2. NODE FAILURE MODEL 

Consider first the embedding of a functional (d—1)-subcube in the 
presence of failures. While it is true that a single node failure must 
always leave an undamaged (d—1)-subcube, as few as two failures could 
destroy all such subcubes. For example, if node 0 and node N — 1 fail, 
there is no way to embed an undamaged (d—1)-subcube in the damaged 
d-cube. Given an arbitrary set of node failures, a fault free (d-1)- 
subcube exists if and only if all the faulty nodes may be contained in an 
i-subcube, i < d. (Necessity follows from the example presented above; 
sufficiency becomes obvious when considering that a d-cube may be 
divided into two disjoint (d—1)-subcubes and all faulty nodes positioned 
totally within one of the subcubes.) 

We define §; to be the system state in which all node failures in the 
cube are contained in an i-subcube, but not in an (/—1)-subcube for 
0<i<d. In terms of functional subcubes, state §; can be characterized 
as embedding exactly d-i disjoint subcubes of order d-1, d-2....,i, 
respectively. (To see this, split the d-cube into two (d—1)-cubes with 
one of these cubes containing the faulty i-subcube; split the latter (d—1)- 
cube into two (d—2)-cubes, with one of these cubes containing the faulty 
i-subcube, etc.) This sequence of functional subcube sizing is unique if 
we insist on a maximal disjoint subcube at each point in the sequence. 
(Note that while the size sequence is unique, there may be many ways to 
generate a subcube of a particular size.) Also worth noting is the fact 
that even though no additional disjoint subcube of order 2i may be 
embedded, it might be possible to embed functional subcubes of order 
< i—1 inside the faulty i-subcube. In state §,, no embedding of a func- 
tional (d—1)-subcube is possible. Finally, S+ represents the initial, fault- 
free state of the d-cube. We assume that all nodes have an identical 
exponential failure distribution with constant failure rate 2. The transi- 
tions between these states are shown in Fig. 2. 

The transition from state S; to state S;,; (0 < j <d-i) occurs when an 
additional fault has occurred outside the damaged i-subcube and the new 
damaged cube will be of size i+j. To determine the rate, imagine the d- 
cube as split across i dimensions such that each node in the original 
damaged i-cube is in a separate partition. Each of the resultant 2’ parti- 
tions is a (d-i)-cube containing exactly one node from the damaged i- 


cube. The new failure must be in one of these partitions, say C; further, 
all paths within region C cross none of the i dimensions used during the 
original splitting process. Therefore, if the new failure in C is distance j 
away from the node in C that belongs to the damaged i-cube, all dam- 
aged nodes in the system may be contained in an (i+j)-subcube. There 
are Co such nodes in each C, leading to a transition rate of 2! (4 j 
Let us denote the probability of being in state 5; as a function of time 
as P;(t). Then the state equations for this system are given by 
OP 
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The initial conditions are P+(0) = 1, and P;(0) = 0 for all 7. It can be 
shown (by induction on /) that the solution to this system of equations 
can be written as: 
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Let us now define the cumulative probabilities R,(t), 0<i<d, as 
Re(t) = Px(t), Ro) =Po) + Re), R(O=PjO+ Ri), 


Thus, R;(t) is the probability that all node failures (if any) up to time ¢ 
are contained within an i-subcube, leaving d—i functional disjoint sub- 
cubes of order d—1, d—2,..., i. These functions are plotted in Fig. 3 for a 
10-cube. (A conservative node failure rate of X= 10° per hour is 
assumed.) R;(t) represents the reliability of the system, if our system 
breakdown condition is based on a fault-containment criterion, viz., all 
the node failures are not contained in an i-subcube. The system’s mean 
time to failure (7) can be evaluated under this criterion by integrating the 
expression for R;(¢). It is easy to see that for 0 < i <d, 
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These are evaluated in Table 1 for various system sizes. 

All the numbers in Table | would scale linearly with 4. Let us inter- 
pret these numbers using the 10-cube as an example. The mean time to 
the first node failure is 98 hours. If the system can stay operational with 
disjoint subcubes of order 9,8,7,6,5, and 4, (ie., all the failures are 
confined to a 4-cube), then the MTTF increases to about 10 days. Note 
that much of this increase materializes from just the ability to tolerate 
one node failure — T,) = 195 hours. However, as we relax our fault 
containment criterion beyond a 5-cube (a (d2)}-cube in general), the 
increases in MTTF are much more substantial. For very large cubes 
(d = 12), insistence on a functional (d—1)-subcube is indeed a stiff con- 
dition to satisfy, as the MTTF numbers testify. 


3. LINK FAILURE MODEL 

We now consider the state of the damaged cube under a link failure 
model. As in the last section, we are interested only in system 
configurations that can embed at least a (d—1)-subcube. Link failures 
affect the topology of the cube in a fundamentally different way than 
node failures do. Two disjoint (d—1)-cubes are formed each time a @- 
cube is split across one of its dimensions. Since there are exactly d 
ways to perform this split, there are only 2d possible (d—1)-subcubes that 
may be embedded in a d-cube. Let us label these subcubes C;, for 
Osi<dand0<j <1. (Refer to Fig. 1.) C;; will denote the subcube 
comprised of nodes with a j in their i bit (and the links between these 
nodes). Obviously, subcube Cj is disjoint from C;, for 0<i<d. 
Any other pair of subcubes C; ; and C,, (¢ # k) will share a (d—2)-cube 
comprised of exactly those nodes (and interconnecting links) with a j in 
their i bit and an / in their k™ bit. If one of the links in this (d—2)- 
cube fails, then both C; ; and C,, will be damaged. In fact, the first link 
failure to occur in a cube will damage d—1 subcubes; subsequent failures 
may or may not affect the remaining undamaged subcubes. To deter- 
mine subcube reliability, we must characterize the effect of the failure of 
each link in the system for all relevant system states. 
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For the moment, consider the failure of the link between nodes 0 and 
1. This damages only subcubes C; 9, i > 0 leaving the remaining sub- 
cubes undamaged. While any link in the system is equally likely to fail, 
for the purposes of analysis, we can relabel all the nodes in the cube so 
that the faulty link is mapped into the link between nodes 0 and 1. 
Since only one dimension is fixed by this procedure (the dimension 
which must be labeled 0), there are (d—1)! equally satisfactory ways to 
label the remaining dimensions. When additional links fail, an appropri- 
ate labeling for the remaining non-fixed dimensions may be chosen from 
this set of (d—1)! labelings so that the subcubes C; 1, (i > 0) may be 
thought of as being damaged in order. That is, C,, is damaged first, 
then Cj, etc., and finally Cz_,,. (The cubes Coo and Co; may be 
damaged at any point in this sequence; we will consider them presently.) 
This ordering of damaged subcubes does not constitute an ordering of 
the link failures; it.is only the selection of a particular labeling for the 
nodes as each failure occurs to describe the system state compactly. 

We now characterize the state of the system by the set of undamaged 
(d—1)-subcubes in the system. The possible system states (for 0 < i < d) 
are: 

S* = {Cy j, O0O<k<d, 0's j <1}, Soi ={Coo, Co Chis i<j<d}, 
Sii = {Co1,Cj1, i <j<d}, Soi = {Cy1, i <j<d}. 


Note that the (d—1)}-subcubes in any state above are not all disjoint. It is 
possible to characterize these states in terms of disjoint subcubes, in a 
manner similar to the state definitions for the node failure model. The 
states $4; each support two disjoint (d—1)-subcubes and these are the 
only states for which there are no equivalent states under the node failure 
model. State $,; embeds d—i fault-free disjomt subcubes of order d-1, 
d—2,..., i. (These are Co, Ca-1,1 a Co, Ca-2,1 =) Ca-1,1 a) Co i508 
respectively.) State $o 3.1, 0<i<d, is an equivalent state in terms of 
disjoint functional subcubes, supporting d—i subcubes of order d—-1, 
d-2,..., i. (These are Cg_1 1, Cg-21 © Cg-1,1,.-, respectively.) Thus it 
is possible to summarize the correspondence between the node and link 
failure models (Figs. 2 and 4) as follows: states S,; in the link failure 
model do not have equivalent states in the node model. State So 
corresponds to state So in the node model; and states $,; and So j-, 
together correspond to state $;, 0 <i<d, in the node model. Finally, 
state So 4-; in the link model corresponds to Sz in the node model, the 
state in which no functional (d—1)-subcube embedding is possible. 

As before, we assume an identical, exponential failure distribution 
(rate 2) for each component (link) in the system. In -Fig. 4 we see the 
state diagram with the transitions between the 3d+1 states. The transi- 
tion rate from S* to Sz is clearly Ad2¢~' as the failure of any link in a 
fault free system will result in a system state of S2o. A detailed descrip- 
tion of the transition rates between the remaining states now follows. 

Let us first consider the transitions from state Sy 9 to state $4 ;; Coo 
and Cy, are both undamaged in these transitions. Any link failure con- 
tributing to this transition must span dimension 0. Without loss of gen- 
erality, let the new failed link be incident to node x, x € Coo. Consider 
such nodes with exactly j 1 bits in the node address. We may remap all 
the nodes in the cube so that the 7 1 bits in the address of node x will be 
in bit positions 1, 2,..., j. Note that x and its new faulty link (across 
dimension 0) are now in C,, for all k, 0<k Sj. Therefore, exactly j 
additional subcubes have been damaged. We may count the number of 
nodes x by counting the nodes in Co with exactly 7 1 bits; this results 


in a transition rate from state S29 to state S, ; of Vere 


To generalize this for the $j; to S,;,; transition, (0 <i <d, 
0 < j < d-i) we are again concerned only with faulty links that span the 
0" dimension. Since the subcubes C, , 0< k <i are already damaged, 
only the d—1-i subcubes C,,, i<k <d need be considered. Let us 
consider a faulty link incident to node x, x € Coo, with a node address 
containing exactly j 1 bits in the bit positions greater than i. Since the 
only dimension labels fixed from previous mappings are those <i, we 
may remap all the nodes in the cube so that the j 1 bits just described 
are in bit positions i+1, i+2,...,i+j. Clearly x and its new faulty link are 
in C,, for all k,i<k Si+j. Thus, we count the number of nodes in 
Coo with an address containing exactly j 1 bits in bit positions k, 


i<k<d and obtain a transition rate of 1 as This rate (among 


others) is shown in Fig. Sa. 
The remainder of the transitions from state $2; involve the links 


which do not span dimension 0; failure of any of the non-zero dimen- 


sioned links will damage either C'o9 or Co, plus j additional subcubes 
0 <j <d-i. Since we may remap the cube so that Coo is always dam- 
aged first, we may consider links in Co without loss of generality. To 
account for link failures in Cy, the final rate will be doubled. 

Consider some node x in Coo with an address containing j 1 bits in 
the d—1-—i bit positions greater than i. As before, we may remap all the 
nodes in the cube so that the j 1 bits just described are in bit positions 
i+], 1+2,..., i+j. We need to determine all the links incident to x (other 
than the link spanning dimension 0) such that x and this link will be in 
Cy; for all k, i <k <i+j. The only links which fit this description span 
the dimensions 1, 2,..., i, i+j+1, i+j+2,..., d—1. We may determine the 
number of nodes x by counting the number of nodes in Coo with an 
address containing exactly j 1 bits in bit positions k, i<k<d. How- 
ever, we may not simply multiply this figure by d—1—/ to obtain the 
number of links since the links which span dimensions 1, 2,..., i are 
incident to two nodes with exactly j 1 bits in bit positions k, i < k <d. 
Thus, half of these links must be subtracted out so that the total number 
of links is 2! CI N\d-1-i2- j). To take into account links in Co ,, we 
double this figure to obtain the rate for the S; to S, (4; transition as 
agirid ~; @-1-i2-j) for 0<i<d,0<j<d-i. 

The transitions from state S,;, 0 <i<d (Fig. 5b), are similar to the 
transitions out of state S»;. For the $,; to Sy i4; (0< j < d-1) transi- 


tion, start with the rate a2ia-1-4) for links spanning dimension 0. 


Since Coo is already damaged, we must add some of the links within 
Coo. (In the Sy; case, failure of these links caused a transition from 


State S; to state $, j,;.) Thus we add r2! (4 Pe \d-1-i/2-j) for a total 
d-1-i 


rate of r2/( j \(d-i/2—-j). The transition from state §,; to state § 0,14 
(0 <j < d-i) is identical to the $4; to $,j4; case except that since Coo 
has already been damaged, we do not need to double the rate. Thus the 
transition rate is simply 2! ‘os 7 Xd-1-i2-j). 

Finally, the transitions from state So; 0Si<d (Fig. 5c) may be 
obtained by adding the $,; to S,j;,; transition rate to the $,; to So i4; 
transitions rate for O<j<d-i. Thus we obtain a rate of 
nz AT 2d -2j-i-1) 

Now that we have completely specified the rates of all transitions in 
the state diagram, we may write the state equations. Let P; ;(t) be the 
probability of being in state S;; at time ¢ Solutions for P; ;(t) and Px(t) 
may be expressed as sums of exponentials, just as the solutions for the 
node model were. However, we have been unable to obtain closed form 
solutions for the coefficients in this case and have evaluated these numer- 
ically. Based on the results, one can define the following probabilities: 


P(t) = Probability that no failures have occurred 
P o_(4~1)cubes(t) = Probability that link failures have occurred, but two 
d-1 
disjoint (d-1)-subcubes can be embedded = })P2 j(¢) 
i=0 
P;(t) = Probability that link failures have occurred leaving 
d—1 functional disjoint subcubes of order d—1,..., i. 
: ( 1,0(¢) i=0 
Py it)+ PojpiG@) 1>0° 
Reliability measures are then given by: 


R(t) = Px(t), Ro4a-1cubes(t) = Re(t) + Pra-1)cubes(t) » 


Ro(t) = Re-(d-1)cubes(t) + Pott), R(t) = Ri-1(t) + Pit), O<i<d. 
The mean time to failure figures, 7, corresponding to these reliability 
measures are shown in Table 2 for a 10-cube with A = 10 per hour. 
Note that we have used a lower failure rate for the links than that for 
the nodes, to account for their lower logical complexity. It is interesting 
to see from Tables 1 and 2 that the increase in MTTF over the node 
model is not commensurate with the decreased failure rate. For d = 6, 
the increase in 7;’s is by a factor of 4 to 5S, whereas for d = 12, the 
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increase is about a factor of 2. The effects of link and node failures on 
the system are indeed different. A single node failure damages d possi- 
ble (d—1)-subcubes, while a link failure destroys only d—1 of these. As 
few as two node failures can destroy all (d—1)-subcube embeddings in a 
d-cube, whereas this requires three failures under the link model. The 
more catastrophic effect of node failures on the system is offset by the 
fact that there are d/2 times as many links as there are nodes. This 
second effect begins to dominate as the size of the network increases 
(beyond d = 3). In particular, if identical failure rates were assumed for 
nodes and links, the mean time to failure under the link model would be 
less than that under the node model for d > 3. 

Other observations made in the last section regarding Table 1 also ~ 
apply here. One way to relax the size of the functional subcubes is to 
insist on 2¢~ disjoint i-subcubes, (d—i) > 1, for the system to be func- 
tional. The mean time to failure for this condition can be derived in a 
straightforward manner. We define d —i + 2 states S;, 0 < j < d-i+1), 
with all the faulty links in state S; spanning exactly j dimensions. State 
Sg—j4) 18 the breakdown state. The number of link failures that can cause 
a transition from state S; to S;,, is d— j)24". The state equations for 
this new system can be solved and it can be shown that the mean time to 
failure is given by 
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This is evaluated in Table 3 for d=10, 11, and 12, and various values of 
i, We can See a significant increase in MTTF as the size of the functional 
portions is reduced. Thus in a 12-cube, the MITE for two functional 
11-cubes is 85 hours, whereas it is over 300 hours for 32 functional 7- 
cubes. 


4. COMBINED NODE AND LINK FAILURE MODEL 

While we have explored the probability of embedding functional sub- 
cubes in a damaged d-cube under a node failure model and a link failure 
model, we have not considered what happens when both node and link 
failures may occur. Under the classic node failure model we considered 
earlier, link failures may be disregarded since a link failure may be 
modeled as the failure of one of its terminal nodes. Thus the node 
failure rate encompassed the failure of the node and all its incident links. 
However, to apply this technique to the link failure model would necessi- 
tate modeling a node failure as the simultaneous failure of all the node’s 
incident fault free links. This would violate the assumption that failures 
are independently distributed. For this reason, a combined node and link 
failure fault model is developed here. 

The node failure rate will be denoted by A, and the link failure rate by 
X,. Both rates are constant and independent. For our new model, let us 
begin with the link failure model developed in the last section. The 
question now arises as to what happens to the links incident to a failed 
node. A link failure is significant to the analysis of the previous section 
if and only if it belongs to some (as yet) undamaged (d—1)-subcube. 
Links incident to failed nodes are not a part of any undamaged subcube 
and thus may be ignored in the combined model. Thus the state diagram 
and link failure transitions (with 2 = ,) depicted in Figs. 4 and 5 accu- 
rately describe the transitions due to link failures (alone) for the new 
model. All that remains is to add the node failure transition rates. 
These rates are developed in the following paragraphs. 

When a node failure occurs in a previously fault free system, the fault 
may be mapped to node 0. Clearly this damages d subcubes (C; 9 for 
0 <i <d) and corresponds to the system state $9. Thus we have a 
new transition from state S+ to state S$, with rate An2” (see Fig. 6a). 
Any node failure in state S.;, 0 Si < d, puts the system into state $j j,;,_ 
0 <j < d-i, since it damages either Co9 or Co, and j subcubes C, ;, 
i<k<d. The rates of these transitions may be derived in a manner 
similar to the way link failures were derived, i.e., counting the number of 
nodes with addresses containing j 1 bits in the d-—1-i bit. positions 
greater than i. The rate must be doubled to account for failures in both 
Coo and Co, leading to the transitions depicted in Fig. 6a. 

In state S,;, 0 <i <d, a node failure will damage all, some or none 
of the remaining subcubes. First considering the nodes in Coo, we note 
that only those nodes with addresses containing 1 bits in bit positions 


T; 


greater than 7 will damage additional subcubes. These are the transitions 
from state S$, ; to state S, i,;, 0 < j < d—i shown in Fig. 6b. Next, if the 
faulty node is in C9 , at least one subcube (Co) will be damaged; / 
additional subcubes 0 <j <d-i will be damaged for nodes with 
addresses containing j 1 bits in bit positions greater than i. These are 
the transitions from state $;; to state Soj,; depicted in Fig. 6b. Note 
that only these 27 — 2' node failures are relevant for the transitions to 
state So i4;. 

Finally, the transitions from state $9 ;, 0 <i <d, (Fig. 6c) correspond 
to the failure of nodes with addresses containing / | bits, 0 < j < d-i, in 
bit positions greater than 7. 

All the node transition rates described above can now be combined 
with the link failure rates developed in the previous section to derive the 
system state equations. When i, = 0, this system is obviously identical 
to that developed in the preceding section for the link failure model. 
When A, = 0, the states S.; are no longer a part of the model. This 
removes the difference between the link and node failure models referred 
to in the previous section so that states §, ; and So j-,; combined can be 
renamed §;, to derive the node model. (State S$, would correspond to 
So, and state Soy4_; to Sz.) We can solve the state equations for the 
combined model numerically to obtain the state probability distributions. 
Reliability measures defined in the last section can be computed under 
this model, and corresponding MTTF’s are shown in Table 4. 

Clearly the reliability and MTTF of the network are lower when both 
link and node failures are taken into account, than when only one of 
them is. Node failures have the dominant effect, but only due to their 
higher failure rate as discussed at the end of Section 3. However the 
results of this section show that neither component can be neglected 
(unless one has a much lower failure rate than the other) and the com- 
bined model gives a more accurate picture of system degradation. 

We close this section by presenting a model that would seem initially 
to be a gross approximation to the combined model, but whose accuracy 
relative to its simplicity turns out to be excellent. Let us go back to the 
node failure model of Section 2 and define a supernode to be a node plus 
half its incident links. (Each link thus ‘“‘belongs’’ to one node.) The 
failure of any of the (1+d2) components of the supernode would lead to 
its failure. (This is where the approximation lies.) The failure rate for 
each supernode is then given by A=A, +dA,2. When d is odd, we will 
use this expression as an estimate for the supernode failure rate, even 
though we cannot associate the same number of links with each super- 
node. Substituting this value of 4 into the node failure model of Section 
2, we derive the mean time to failure figures in Table 5. The approxima- 
tion is clearly conservative in that not every link failure would in reality 
affect the node (say, A) to which it is assigned; if the other end of the 
link is connected to a node which has already failed, then node A need 
not be brought down when the link fails. However, it provides results 
very close to those from the exact model. 


5. REMAPPING THE CUBE NODES 

All the mappings described in previous sections were intended solely 
for the purposes of analysis; in this section we describe a procedure for 
selecting an appropriate mapping for the nodes in the cube so that 
undamaged subcubes may be used by application tasks. The first step is 
to determine which of the possible embedded (d—1)-subcubes are undam- 
aged. Next, the number and orders of disjoint subcubes must be 
identified. The final chore of remapping is then reduced to fixing some k 
bits in the node address resulting in a d—k subcube. We first describe 
the procedure for the case where we have subcubes of order d—1, d-2...., 
i; subsequently, the procedure for 27~ functional subcubes is explained. 

We label the possible (d—1)-subcube embeddings C;;, 0 <i <d, 
0 <j <1 as described earlier. When some node x fails, it damages 
exactly d of these embeddings. The identity of the damaged cubes is 
determined using the following algorithm: 

for all bits x; in the node address x, set C;.,, = DAMAGED; 
As each link fails, it damages exactly d—1 embeddings. For embedding 
Cj; to be damaged, both nodes incident to the faulty link must lie in 
C;,;- If we describe the damaged link as incident to node x across 
dimension j, we may use the following algorithm to determine what 
embeddings are no longer possible: 

for all bits x; in the node address x, if (i # j) set C; ,, = DAMAGED; 
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After the list of current failures has been evaluated, the system may 
support two disjoint (d—1)-subcubes if and only if there exists an integer 
k such that both C,9 and C,, are undamaged. (This corresponds to a 
system state of S,;.) Otherwise, there are k undamaged, non-disjoint 
(d—1)-subcubes. Let these subcubes be labeled C;, ;,, 
From these, k disjoint subcubes of order d—1, d-2,..., d—k may be 
obtained as follows: the (d—1)-subcube is comprised of all nodes x such 
that x;, =j,; the (d-2)-subcube is comprised of all nodes x such that 
Xj, = Jz and x;, # j,; etc. In general, the (d—I)-subcube is comprised of 
all nodes x such that x;, = j), and x; # j,, for all m <1. This completely 
identifies the disjoint, functional subcubes. 

The procedure for 24’ embedded subcubes of order i is somewhat 
simpler. Each node need only keep track of dimensions that have no 
faulty links. That is, when a link failure occurs across some dimension 
j, then that dimension is no longer considered fault free. As long as the 
number of fault free dimensions is at least i, the embedding is possible. 
The subcubes may be identified by fixing all but the 7 lowest fault free 
dimensions yielding the 27~' subcubes. 


6. SUMMARY 

We have attempted to characterize in this paper the reliability and 
degradation of the hypercube structure as network components begin to 
fail, The analysis was based upon the damaged subcube’s ability to 
embed functional subcubes of different sizes. Three different models 
were presented: one that considered node failures alone, another that con- 
sidered only link failures, and a third that permitted both node and link 
failures. The latter two models have comparable complexity, so that the 
real choice is between the relatively simple node model and the more 
realistic, but complex combined node and link model. We have also sug- 
gested a technique to incorporate link failures into the nodes, that yields 
a simple but effective approximation to the combined model. While the 
effect of a single link failure is less catastrophic than that of a node 
failure, the larger number of links results in their having the dominant 
effect on system reliability, if nodes and links have comparable failure 
rates. These studies show that even though the d-cube structure is des- 
troyed by the very first component failure, the cube is quite resilient in 
terms of its ability to support several smaller subcubes in the damaged 
structure. 
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SOLVING VISIBILITY PROBLEMS ON MCC’S 


Mi Lu 


Department of Electrical Engineering 
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Abstract — In this paper, we present MCC algorithms 
to solve the visibility problem for a set of disjoint simple 
objects such as line segments, circles, and simple polygons in 
the plane. For a collection of n such objects, our algorithms 
show how to compute, on a \/n x /n MCC, a view of these 
objects in O(,/n) time. Both parallel and perspective view 
are considered. The previous algorithms for computing the 
views are sequential and have O(nlogn) time complexity 
[1}. 

For the above tasks, we also describe methods to solve 
problems of size n on MCC’s with p processors, where 
p <n. Analysis will be given on the time complexity and 
the limitations imposed by the computational and commu- 
nication requirements. 


I. Introduction 


The Mesh-Connected Computer operates as a single in- 
struction stream, multiple data stream (SIMD) computer in 
which each PE can directly communicate with at most four 
neighbors. A ./n x ./n MCC consists of n identical PE’s 
arranged on a two dimensional grid with processors at the 
grid points and connections between every horizontal and 
vertical pairs of PE’s. Each PE has a constant number of 
storage registers, and each can perform standard arithmetic 
or boolean operations in O(1) time. MCC have been widely 
used in different areas, and MCC algorithms have been de- 
signed to solve various problems [2-6]. 


An important and fundamental algorithmic problem in 
computer graphics is the following: Given a set of objects 
in three-dimensional space, compute the view from some 
fixed direction or point. The main issue is to eliminate all 
parts. of the objects that cannot been seen (i.e., that lie 
behind some other object). It is a generalization of the hid- 
den line problem in which objects have straight line edges. 
The problem also has numerous applications in the motion 
planning of robotics, and VLSI layout which have attracted 
considerable attention in the recent years. In our consid- 
eration, we simplify the preceding problem conceptually to 
two-dimensional space, not only because that it is often used 
as a subproblem in other geometric problems, (the shortest- 
path problem, for example,) but also due to that the solu- 
tion for two-dimensional problem is the main part of the 
tree-dimensional solution, and will show directions for fur- 
ther research on the corresponding problems in three and 
higher-dimensional space. 


In the rest of the paper, we present in Chapter II 
the MCC algorithms for solving visibility problems, and in 
Chapter II the problem solution on MCC’s of smaller size. 
Chapter IV will give the conclusion remarks. 
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IT. Solving Visibility Problem on MCC’S 


The approach we use to solve the visibility problem is 
divide-and-conquer. Applying on the mesh, we divide the 
mesh into two submeshes of, equal size, called left and right 
(or upper and lower respectively) submeshes. We recur- 
sively solve the two subproblems on two submeshes in par- 
allel, and then combine the two subsolutions to obtain the 
final result. Elegant data movement need to be designed for 
the merge step to exploit the inherent concurrency. 


2.1 The visibility problem 


A view is a picture one sees looking from a direction 
or a point. A view from a point is a perspective view. In 
this case the view consists of a circle on which the parts of 
the objects one can see from the given point are projected. 
A view from a direction is a parallel view. In this case, the 
view consists of a line on which the parts of the objects that 
are visible from the direction are projected. Perspective 
views correspond to our natural way of viewing from a place 
close to the object set, and parallel views correspond to the 
viewing from a place far from the object set. 


A simple object is a bounded convex object with the 
following properties: 
1. Any parallel or perspective view of it can be computed 
in constant time. 
2. If the views of two simple objects overlap, then constant 
time suffices to decide which one of the two objects can be 
seen entirely. 
3. The up to four common tangents of two simple objects 
can be computed in constant time. Two objects that touch 
each other (i.e., objects whose boundaries overlap but do not 
cross) are by definition nonintersecting. Typical examples of 
simple objects are line segments, disks, and convex polygons 
with a bounded number of edges. A set of simple objects 
is shown in Figure 1. Figure 2 is the perspective view of 
them, and Figure 3 is the parallel view of them. 


The visibility problem can be formally stated as follows: 
Given a set of n disjoint simple objects, and a point or a 
direction, report in order the parts of the objects that are 
visible from the given point or the given direction. A point p 
is visible from gq if the line segment pq intersects no objects 
in the set. 


The research work done on visibility problem are in- 
cluded in the following papers. For the case of a single 
polygon E] Gindy and Avis [7] and Lee [8] presented O(n) 
algorithms. Asano [1] gave an O(n-+ hlogh) time algorithm 
for the case where the h disjoint polygons are convex, and 
an O(nlogh) time algorithm for the general problem. It 
has been proved that for h disjoint polygons with n edges, 


the optimal time complexity to find the visibility polygon 
using O(n) space is bounded by O(n + hlogn) [9]. In [10], 
Edelsbrunner et al. used O(n) search time, O(n?logn) pre- 
processing time and O(n*logn) space to solve the visibility 
polygon problem. Recently, Asano et al. [9] solved the visi- 
bility polygon problem in O(n”) preprocessing time, O(n”) 
space and O(n) time. The visibility graph of disjoint poly- 
gons with n edges can be found by solving the visibility 
problem from each vertex of those polygons. This prob- 
lem had been previously solved in time O(n?logn) by Lee 
[11] and recently by Welz] {12] and Asano et al. [9] indepen- 
dently in O(n”) time, The shortest path between two points 
in the plane with polygonal obstacles can be computed by 
applying Dijkstra’s algorithm to the visibility graph of the 
obstacles. This problem is of current interest because it 
is an instance of a general class of important problems in 
robotics, known as collision avoidance problems (see, for 
example, Lozano-Perez and Sesley [13]). 


2.2 Computing the parallel view 


A parallel view of a set of objects consists of a partition 
of a line. Each part of the line corresponds to an object in 
the set (from the direction of view). The lowest part of the 
objects in the interval (k,k + 1) is visible, and we want to 
find the lowest part in all intervals, i.e., the lower envelope 
of the set of objects. (See Figure 4.) 


To each part of the line we assign the index of the cor- 
responding object. If the part corresponds to a place where 
one can look through the set we assign NULL to it. It is 
possible that different parts of the line correspond to the 
same object, and hence, are assigned the same index (see, 
for example, object 2 in Fig. 4). A partition point corre- 
sponds to a leftmost point of an object (Fig. 5(A)) or a 
rightmost point of an object (Fig. 5(B)), with respect to 
the direction of view. (A part of the line might consist of 
one point if it corresponds to a line segment in the direction 
of view. In this case we treat it as a double partition point). 
It can be proved that the parallel view of a set of n objects 
from a fixed direction consists of at most 2n + 1 parts and 
at most 2n partition points [10]. 


For computing the view of a set of objects from 
a fixed direction we will use a divide-and-conquer tech- 
Divide the set S of n objects into subsets A 
and B, each containing approximately equal number of 
objects. Let. the partition points of a view of A be 
{do, Q1, +++, Ay}, k < 2n—1, and the parts of the view of A 
be {@oa1, @1a2, +++, @y_,;@,}. Similarly, let the partition 
points of a view of B be {bo, bi, ---, by}, & < 2n —1, and 
the parts of the view of B be {bobi, Bi bag ek b;, bz}. 
A part of the view of A, @;@j41, is a part of the view of 
S = AUB iff projecting ajaj{7 to the view does not 
cross any other objects, that is, no part of the view of B 
falls (even partially) in the interval of (a;, @;4,), or part 


nique. 


b 5b; 4 1 falls in the interval but ajaj4] is “closer” to the ob- 
server than 6;6;,,. We can compute the view of S = AUB 
by checking at each partition point of both views whether or 
not this point is also a partition point of the total view. Asa 


result of the definition of a simple object, this checking can. 


be done in constant time. Recursively partition the prob- 
lem into two equal-sized subproblems, compute the views 
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of the two subsets simultaneously, and merge the results. 
Assuming that the merge of two subproblems, whose sizes 
sum to n, needs time M(n), the total time needed, T'(n), 
is given by the following recurrence: 


T(n) = T(n/2) + M(n) 
We show below that M(n) is bounded by O(./n) in the 


mesh-connected computer implementation. 


Distribute the partition points on an MCC, one point 
per PE. Since there are 27 partion points in a view of n 
objects, 2,/n x ./n or \/n x 24/n PE’s are sufficient. The 
PE containing partition point @; maintains the part @j@j4). 
Let a submesh of size 2” be A, and its adjacent submesh 
be B. Submesh A contains the view of the subset A € S, 
and submesh B contains the subset B € S. The merging 
of the views of subsets A and B is performed recursively 
on submeshes A and B, which is of size 2”*!. In iteration 
i, 2'+! PE’s are involved. The two phases involved in the 
merge are as follows: 
(i) Each partition point a; in the view of A finds the part 
b;b;4, such that a; falls in the interval of 6;b;,,, and vice 
versa. (Fig. 6 is a reference.) 
(ii) Decide whether 6;6;,, prevents the part @j;a;4, from 
being visible, and vice versa. 


Phase (i) can be done by finding the difference 
of global_rank and local_rank of a;, which tells the 
local_rank of b;. We complete phase (ii) by a transforma- 
tion of the coordinate system. Let the angle of the observing 
direction be a. Rotate the coordinate axes by a@ to obtain 
the new axes system. The point with the coordinates (z’, y’) 
under the old axis system will have coordinates (2, y) under 
the new axis system such that 


v= # cosa — y sing 
y = 2@ sina — y cosa 


Then the view of a single object is just the horizontal] line 
segment with the leftmost point and the rightmost point of 
the object as its end points. The projection of the object in 
subset A, say 2, will across the object in subset B, say j, if f 
Yi > yj, where y; is the y coordinate of the leftmost point 
or rightmost point of object 2 and y; is the y coordinate of 
the leftmost point or rightmost point of object 7. The MCC 
algorithm will be given in algorithm Parallel View. 


The record maintained in each PE includes the follow- 
ing field: 
VIEW 1: Record 


i /* index of the PE, can be also used as index of the 
object */ 
z, y /* coordinates of the leftmost or the rightmost 
point of the object */ 


see /* the index of the part in the view, NULL if no 
object is visible */ 
local_rank /* indicating in r*” iteration the rank of 
the partition point in the view of 2” 
objects */ 


global_rank /* indicating in r*” iteration the rank of 
the partition point in the view of 27+? 
objects */ 


base /* recording in rt? iteration the (logn+1- 7) 
MSB’s of the PE index */ 


biase/* temporary variable to record the local_rank of 
a partition point in the view to be merged */ 


end (/* of record */) 


Algorithm Parallel View 


1. Distribute the n objects on the MCC, so that each con- 
secutive two PE’s contain the same object, say 1. 


2. PE(2k) finds the leftmost point of the object it contains | 


and PE(2k + 1) finds the rightmost point of the object it 
contains, fork = 0, 1, ---, n—1. /* initialize the partition 
points */ 

Record the coordinates of the points found as (2, y). 
PE(2k) sets see = 1. PE(2k +1) sets see = NULL. 


for 7:=1 to logn do the following: 


3. base = (logn +1 — 71) MSB’s of i. /* represented by 
bp - ++ by bo a | 


4. Sort the partition points in non-decreasing order by z, 
on the 2’*? submesh. Find the global_rank of each point. 


5. Sort the partition points in non-decreasing order by y, 
on the 2” submesh. Find the local_rank of each point. 


6. Each PE compute 
bias = global_rank — local_rank — 1. 


and concatenates it to base with the LSB complemented 
(denoted as base(by)). 


7. Each PE performs a RAR from PE(addr) to get tgqar 
and S€€qddr- 


8. if ((y > Yaddr) A (S€€addr = ~NULL)) V ((y < Yadar ) 
A (see = NULL)) 


SEE = S€€aqddr: 


end (/* of algorithm Parallel View */) 


Step 1 takes O(,/n) time. Step 2 needs only constant 
time since that the objects are simple. Step 3 needs constant 
time also. Sorting in step 4 and step 5 requires O(V 27+!) 
time and O(/27) time respectively. The time needed by 
steps 6 and 8 are constant. The RAR performed in step 7 
uses time O(2"). Thus, the time required in iteration 7 is 
bounded by O(./27) and the total time needed in all the 


iterations is: 
T(n) = V24+ V2? +---4 V glegn = O(n). 


We have considered the view from a given direction, 
i.e., a parallel view. However, there is another interest- 
ing type of view, called a perspective view, which consists 
of the portions of the objects that are visible from a given 
point. The problem of finding the perspective view from an 
arbitrary point is discussed in the following section. 


2.3 Computing the perspective view 
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Let S be aset of n simple objects, and g be an arbitrary 
query point. We want to find the parts of the objects in S 
that are visible from q, that is, find the perspective view. 


A perspective view of a set of objects consists of a par- 
tition of a circle. Each circle segment corresponds to a part 
of an object that one can see from the fixed point or to a 
place where one can look through the set (see Fig. 7 as an 
example). 


It is easy to verify that a perspective view contains at 
most 2n partition points and hence at most 2n + 1 parts. 


Consider a polar coordinate system with the point q as 
the origin and the positive y-axis as the reference. Denote 
the polar angle of a point p; by 0(p;), where the polar angle 
increases counterclockwise around q. The polar coordinates 
of a point can be represented as (p, 6). The leftmost point 
pi or the rightmost point p, of an object is the tangency 
point such that the line emanating from q is tangent to the 


object at it and O(p;) > O(p,). 


Computing the perspective view is similar to the com- 
puting of parallel view. A similar divide-and-conquer tech- 
nique is adopted. If po, pi, -::, De, kK < 2n—1, denote 
the partition points, then p;~ p;4, indicates the part of the 
view. We will distribute the partition points on the mesh 
with each PE containing one partition point and maintain- 
ing the record of the part pj” pi41. 


The problem of finding the perspective view can be de- 
composed into the following two subproblems. 
(i) Each partition point a; in the view of A finds the part 
b; ~~ bj 4, such that 0(b;) < 0(a;) < 0(b; 41), and vice versa. 
(ii) Decide the visible part of the objects in the interval 


(pi, Pi4r)- 
The visibility can be checked by determining which is the 


nearest to g among those objects with the ray emanating 


from q, extended from 6(p;) to 6(p;41), passed through its 
interior. We will show below that we can find, in polar or- 


der, the parts of the given set of n objects that are visible 


from q in O(\/n) time. In fact, if we cut the plane along 
the ray emanating upwards from q and spread it out accord- 
ing to the angular and radial orders, the spread-out view is 
similar to the one we discussed in last section. 


The transformation of the coordinate system needs not 
only a rotation but also a translation of the axes. Let (2’, y’) 
be the coordinates before the rotation, (z”, y’’) the ones be- 
fore the translation, and (h,v) the position of the observer 
in the old coordinate system. We have 


/ 


y = 2% sina-—ycosa 


z'= 2 cosa — y sina 
y) 


and 


In addition, 


will complete transforming Cartesian system to polar sys- 
tem. We consider below the algorithm Perspective View, an 


MCC algorithm for computing a perspective view from a 
given point. 


In Algorithm Perspective View, the record VIEW 2 
maintained in each PE is similar to record VIEW 1 given in 
section 2.2, except that the field “zx, y” is changed to “p, 0”. 
Algorithm Perspective View is identical to Algorithm Paral- 
lel View except for step 8. Following is a modified version 
of step 8. 


8". if ((p > Padar ) A (S€€adar = -NULL)) V ((y < Yaddr ) 
A (see = NULL)) 


SE€e = S€€qddr- 


The same analysis will show that the time needed to find a 
perspective view of a set of n simple objects is bounded by 
O(n). 

The visibility. problem from a point for a set of h (not 
necessarily convex) disjoint polygons with n edges in total 
can be solved with the same time complexity by applying 
the above algorithms to the edges of those polygons. We 
first compute the visible portion of the boundary of each 
polygon from the point. The result is a sequence of edges 
from each polygon. Then the sequence can be decomposed 
where the visible parts in the view are found. 


A visibility graph of n arbitrary oriented segments is a 
graph whose vertices are endpoints of those segments and 
whose edges are the straight line segments joining vertices 
that are visible from each other. This graph can be con- 
structed by solving the visibility problem from each vertex 
for the given segments. As an application of this result, the 
shortest path between two points in the plane with polyg- 
onal obstacles having n edges can be solved by Dijkstra’s 
algorithm, provided that the visibility graph is available. 


III. Solving Problem on MCC’S of Smaller Size 


The described results in the previous chapter were ob- 
tained using the unbounded model of parallel computation, 
i.e. we imposed no limit on the number of processors used by 
our algorithms. We discussed in several papers [14-17] the 
methods to solve some geometrical problems using MCC’s 
of the same size as the size of the problem, that is, the num- 
ber of the elements to be processed is equal to the number 
of the processors in an MCC. Obviously, in any practical sit- 
uation we will be required to handle varying problem sizes 
with a fixed number of processors. The situation that the 
size of the MCC we have is smaller than the problem size 
occurs very often. We introduce in this chapter the algo- 
rithms to solve problems on the smaller size MCC’s, and 
analyze their time complexity and the limitations imposed 
by the computational and communication requirements. 


3.1 Basic approach 


If a problem has n pieces of data initially distributed 
one per PE in a mesh of size n, we now consider what hap- 
pens when we try to solve the problem on a mesh of size p, 
1< p< _n, where each PE is initially given “i pieces of data. 
This requires that the processors have sufficient memory to 
handle the largest problem size that will be encountered. A 
processor with its local storage is referred to as a node. 
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The basic approach to solve problems using MCC’s of 
smaller size is to combine parallel and sequential processing 
on the MCC’s. Previously developed MCC algorithms are 
used for inter-node processing, while sequential algorithms 
are used for intra-node processing. The algorithms include 
two phases: 


(i) Each PE operates on the “i elements it contains inde- 
pendently. Each finds the partial result of the problem 
using the sequential algorithm. 


(ii) PE’s distribute their partial results to other processors, 
using the parallel merging algorithms discussed previ- 
ously. 


Of course, the time taken by each PE to broadcast its initial 
data is 3 times as much as before. 


As described above, n pieces of input data are dis- 
tributed on a ,/p x ,/p MCC, with r pieces of data per PE. 
Two sorting orders are considered, consecutive order and 
cyclic order. In consecutive order, ~ successive elements of 
the sorted sequence are stored in each node, with successive 
sets of = elements being stored in nodes in order of increas- . 
ing node address (see Fig. 8(a)). In cyclic order, node 7 
stores the elements in the set {7 | 7 = rank(j) mod p} 
such that t = 3 mod p. We describe below the details of 
sorting into consecutive order (see Fig. 8(b)). Rearrange- 
ment of consecutive to cyclic storage order (or vice versa) 
can be carried out in time O(F +p), by pipelining the data 
transfers. 


The sorting is carried out first by intra-node processing 
and then inter-node processing. A local sort is performed 
initially. Batcher’s [18,19] odd-even merge is then mapped 
onto the MCC. When doing the local sort, each PE sorts 
the data it contains independently. This can be done by 
using sequential algorithms within O(FlogF) time. After 
that, the = elements stored in a node is a sorted sequence. 
The merge of the sorted sequences can be accomplished by 


inter-node processing in O(%) time [20]. 


Similar to algorithms for performing RAR and RAW 
on MCC’s with constant memory [3], our RAR and RAW 
algorithms on the MCC’s with Ss memory are performed 
using the well-defined operations SORT, RANK, CON- 
CENTRATE, DISTRIBUTE and GENERALIZE. The time 


complexity for both RAR and RAW is bounded by O( 7; + 
7 logs ) (20). 


Since the previously described MCC algorithms for 
solving computational geometry problems are based on sort- 
ing, RAR and RAW techniques, the results of sorting, RAR 
and RAW on MCC’s of smaller size provide the solutions 
of solving computational geometry problems on MCC’s of 
smaller size. | | 


3.2 Lower bound time and optimal size 


We have presented the parallel algorithms for solv- 
ing computational geometry problems on the mesh of un- 
bounded model and on the mesh of smaller size. The new 
questions brought to our attentions are: what is the trade 
off between the time complexity and the number of proces- 
sors? Is it true that the more processors we have, the less 
time are required? 


Given n elements distributed on p processors with = 
elements per PE, where p < n, Figure 9 shows the relation- 
ship of IT versus ,/p. The line 7’ = ,/p indicates the influ- 
ence of the diameter of the mesh. Two curves, J = “logn 
and T = Be are also given which indicate the time needed 
for inner-node processing and for intra-node processing re- 
spectively. 


Since moving a data from, say, the upper left corner of 
a mesh to the lower right corner needs time no less than 
2,/n, T should be greater than ,/n. In the meantime, we 
can find that a > zlogn in general. Thus our working 
area is above the line J’ = ,/p and the curve T' = ae It 
can be observed that the minimal T is given at the point 
(n, Jn). That means, if n processors are provided, we can 
obtain the optimal time complexity which is O(./n). 


The sequential algorithms to solve the previous geomet- 
ric problems have an optimal time complexity of O(nlogn). 
With p processors, O( 5 logn) time performance is desired. 


However, it can be realized only when p < log?n, where 
nr 


vP 


few in that case and the utility of each processor is 100%, 
although the time needed is greater than O(,/n). 


> logn. The number of the processors used are very 


When p > log’n, our working area is bounded by 
I’ = ~~ and the processors can not be utilized with 100% 
efficiency. This is because of the bandwidth hmitation. The 
machine model we used is a limited-connectivity processor 
network. We avoided the complicate interconnections in the 
machine building at the cost of the loss in time performance. 


When more than n processors are given where n is the 
size of the problem, we can find in surprise that J’ increases 
as pincreases. It demonstrates that to put more than n pro- 
cessors in operation is just a west. We can not gain anything 
in time performance, because of that the time complexity 
is bounded by the diameter of the mesh. 


In a word, 0 < p < log’n corresponds to the computa- 
tion bound region in Fig. 9, log’?n < p < n corresponds to 
the connection bound region, and p > n corresponds to the 
diameter bound region. 


IV. Conclusions 


Parallel MCC algorithms for solving visibility problems 
are presented. Given a set of n simple objects in the plane, 
our algorithms can find a parallel view or a perspective view 
of them on a \/n x \/n MCC, and have the optimal O(./n) 
time complexity. Methods for solving above tasks on MCC’s 
with p processors, is also described, where p < n. We ana- 
lyzed their time complexity and the limitations imposed by 
the computational and communication requirements. The 
result is considerable significant since it provides a well per- 
formed approach for solving a general problem often occured 
in practical MCC applications. 
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Figure 5. Two partition points: (A) at the leftmost point 
of an object and (B) at the rightmost point of an object. 
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Abstract 


In a parallel processing system with N processors sharing N 
memory modules, storing a matrix for conflict free access is 
an important problem. In this paper we propose a method 
for storing an N x N matrix in N memory modules such 
that any row, column, forward or backward diagonal of the 
matrix can be accessed by the processors without conflicts. 
It is shown that this problem is similar to the Magic Square 
Puzzle, and algorithms are presented for storing the matrix 
when N = 2" for n > 2 and for odd N. 


1 Introduction 


Pipelined machines and array processors depend on an un- 
interrupted flow of data for high performance and hence 
the organization of vector elements in the memory mod- 
ules is of prime importance. Any conflicts in data fetch can 
severely degrade the performance of these machines. For 
example, a matrix application program may generate an 
access request for a row of an N x N matrix (vector) stored 
in N memory banks. If all the N vector elements are in one 
memory module, N separate memory accesses will be re- 
quired to retrieve an N element vector. On the other hand, 
a single access is sufficient if the vector elements are spread 
across N memory modules. In the case of memory con- 
flicts additional cycles are required to resolve the conflict. 
The processor-memory speed will no longer be balanced, 
reducing the processor speed by a factor proportional to 
the number of conflicts in any memory module. ° 


In our model of a parallel processing system, we consider 
N = 2", n > 2, processors connected to N memory modules 
by an interconnection network, as shown in Fig. 1. Given 
an N x N matrix, we have to find a storage organization for 
each element a,; of the matrix, such that the elements of 
any row, column, forward or backward diagonal of the ma- 
trix are in different memory modules and can be accessed 
simultaneously. The problem, thus, is to find a mapping 
f such that for every element a,;, f(t,7) = m. Here m is 
the memory module number in which a,; is stored and f is 


such that rows, columns and diagonals can be retrieved in 
a single memory access. 


1This research was supported by the NSF (Contract No. MIP- 
8452003) with matching funds from the AT&T Information Systems. 


Fig.1 Parallel Processing System Model 


2 Definitions And Related Re- 
search 


In the remainder of this paper, N = 2” denotes the num- 
ber of processors and M denotes the number of memory 
modules. 


Many authors [3,4,5,6,7,8] have addressed the issue of 
conflict-free access to various vectors (ex. rows, columns, 
diagonals) of a matrix. The function f which assigns a mem- 
ory module to each element of the matrix is called the skew 
function. f is a linear skew function if the indices of the 
element a;; form a linear combination of the type kyz + ka, 
where k, and ke are integer constants. Lawrie [5] has shown 
that for Mf = N, and even value of M, it is not possible 
to find a linear skew function for conflict-free access to the 
rows, columns and diagonals of an N x N matrix. Bud- 
nick and Kuck [3] have shown that for an N x N matrix 
(N = 27*, k is a positive integer), a linear skew function 
cannot provide conflict-free access to the rows, columns, 
diagonals and N 2 x N32 blocks when stored in N = M 
memory modules. 


Furthermore, Shapiro [6] has claimed that for rows, 
columns and diagonals (forward and backward), if there 
does not exist a linear skew function for M = N = 2” for 
providing conflict-free access, then, there is no valid skew- 
ing scheme of any type whatsoever. Deb [4] has provided 
a counter example (Fig. 2) which shows that it is indeed 
possible to store a 4 x 4 matrix (ie. M=N=2",n=2 
) with conflict-free access to rows, columns and the main . 
forward and backward diagonals. 


In the remainder of the paper we show that an N x N matrix 
can be skewed for conflict-free access to rows, columns and 
the main forward and backward diagonals for N = 2” for 
n > 2 and for odd N by mapping the problem to a variation 
of the famous Magic Square Puzzle. We assume that M = 
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N. The following paragraph defines the mapping to the 
Magic Square Puzzle. 


Fig. 3a shows a 4 x 4 matrix without skewing and with 
the linear address of each element marked. Fig. 3b shows 
the elements in their skewed positions as determined by 
some (unknown) skew function. Let us reduce the linear 
addresses of Fig. 3b to their mod-4 values, add 1 and gen- 
erate the matrix in Fig. 3c. It can be seen that the numbers 
1, 2, 3 and 4 occur in each row, column and main diago- 
nals exactly once. If we superimpose Fig. 3c on Fig. 3a 
to generate the matrix in Fig. 3d, we see that associated 
with each element of the unskewed matrix is a number that 
assigns it to a memory module. Notice that the skewing 
has made no change in the row index of the element. This 
assignment results in the skewed matrix in Fig. 3b. 


We see that for any N x N matrix, if there exists an- 
other N x N matrix A whose elements only take on values 
1,2,---,N such that each value occurs exactly once in any 
row, column or diagonal (or any other desired vector), then 
each of these vectors of length N can be accessed without 
conflict when stored in N parallel memory modules. The 
matrix A is called the N x N asstgnment matriz. Fig. 3c 
is an example of a 2 x 2 assignment matrix. 


The following definitions are valid for N = 2”. 


Definition 2.1 Let a vector (1,2,...,.N) be partitioned into 
equal segments of size k = 2°, for some integer e > 0. Then 
the k-segment image permutation of the vector its defined 
forl<x<QN, as 


m,(x) = 2k =] +1—(r+k) 


As an example, the following is a 2-segment tmage permu- 
tation for N = 8. 


123 45 67 8 
2143 65 8 7 


In the following definition a sequence of integers which oc- 
curs in the construction of the assignment matrix is defined. 


Definition 2.2 Let the bit representation of integer x, 
(l1<a2<N-1), be ty2n-1 +++ 2221. Consider the represen- 
tation of all integers from 1 to 2" —1 arranged in ascending 
order such that each integer occurs exactly once. Let x; bit 
of the representation be given a weight of 2'. Then the ha 
element of the sequence ts defined as the weight of the least 
significant occurrence of 1 in the binary representation of 
the j*" integer in the sorted list. 


For example, if n = 3, the sequence of length 7 is (2, 4, 2, 
8, 2, 4, 2). Also, sequence(i), 1 <i < N—1, returns the 7” 
element of this sequence. 


ai3 a1l4 
a21 a22 
a32 a31 
a44 a43 


(b) Skewed Matrix 
with Linear Addresses 


(a) Unskewed Matrix 
with Linear Addresses 


(c) Linear Addresses 
After Taking Modulus 4 
and Adding 1 


(d) Matrix with Memory 
Module Assignment 


Fig. 3 Memory Module Assignment 
for 4%*4 Matrix 


The rows of the assignment matrix A are numbered from 1 
(topmost) through N (bottom) and the columns are num- 
bered from 1 (leftmost) through N (rightmost). 


3 Proposed Solution 


In the Magic Square Puzzle [2] numbers ranging from 1 to 
N? are entered in an N x N matrix such that the sum of 
each row, column or diagonal is a constant. Our problem 
requires numbers ranging from 1 to N to be entered such 
that each row, column and diagonals add up to MINS) We 
solve the problem by breaking it into two cases, namely, 
N = 2” for an integer n > 2-and an odd value of N. We 
shall consider the former case first as the latter case has a 
trivial solution. 
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3.1 NS? i272 


Benson and Jacoby [2] have generated all possible magic 
squares of the 4‘* order (ie. 4 x 4 matrix) using numbers 
ranging from 1 to 16 exactly once. This was done in the 
1970’s with the aid of the computing facilities at Dickinson 
College, Carlisle, Pennsylvania [2]. Exactly 880 such magic 
squares were generated by the computer. This verified the 
result claimed by Frenicle who published the 880 magic 
squares in 1693. 


These squares have been classified into twelve basic types, 
depending on the relation between numbers of the same 
row, column, forward diagonal or back-diagonal [2]. One of 
the basic types is shown in Fig. 4. Arcs have been drawn 
between elements whose sum is a constant. Also, each col- 
umn adds up to a constant. This type is the most relevant 
of the squares-types to our problem of generating the as- 
signment matrix for conflict-free access. Compare Fig. 4 
with Fig. 3c. Using this clue we were able to systematically 
generate the skewed matrix for N = 2",n > 2. 


Fig. 5 shows the assignment matrices for N = 4, N = 8, 
and N = 16. These matrices are divided into four quad- 
rants, of x x x cells each. We observe that some properties 
of the assignment matrix help to simplify its construction. 


Property 1: Exactly 1 through N numbers are used as 
valid entries. 


Property 2: Each of the numbers occur exactly once in 
any row, column or diagonal. 


Property 3: Every two numbers specified by the arcs in 
Fig. 4 add up to N+1. Thus all entries in any of the upper 
(lower) quadrants can be generated if the entries in the 
corresponding lower (upper) quadrants are known. That 
is, A[N +1 —14][j] = N+1— Alt][y], where (1 <i < ¥). 


Property 4: Every column 1, (1 <i < ), in the left 
quadrant (upper or lower) is a N - segment image permuta- 
tion of the (N + 1—1) column of the corresponding (upper 
or lower) right quadrant. 


It is therefore sufficient to generate any one quadrant of 
the N x N matrix. The remaining quadrants can be easily 
generated from this quadrant as shown by the above prop- 
erties. Algorithm 1 gives the construction of the N x N 
assignment matrix A. The algorithm is coded in the C 
programming language. 


Lemma 3.1.1 If any vector X of size 2” 1s a permutation 
in the range 1 to 2” then, a vector Y which 1s computed as 


Y(t] = 2"t'+1— X[t];1<i< 2” 
is a permutation in the range (2" + 1) to 2"*?. 
Theorem 3.1.1 Algorithm 1 generates an N x N assign- 


ment matrix such that each integer between 1 and N occurs 
exactly once in every row, column, forward and backward 


Fig.4 A Basic Type of 
a 4*4 Magic Square 


diagonal. 


Proof: We shall individually prove that every row, column 
and the forward and backward diagonal is a permutation. 
Each of these proofs is done using the induction technique. 


We shall prove that the forward diagonal elements form a 
permutation. The proof for the backward diagonals can be 
done using similar reasoning. Fig. 6a shows the structure 
of the assignment matrix for N = 2', and Fig 6b. for N = 
21, Tracing through the algorithm, we observe that it 
maps the shaded areas of the Fig. 6a to the corresponding 
shaded areas in Fig. 6b. 


Basis : N = 4. The assignment matrix given in Fig. 5a is 
generated from the algorithm. The diagonal elements form 
a permutation on numbers in the range 1 to 4 in the 2? x 2? 
assignment matrix. 


Hypothesis : The diagonal elements form a permutation 
on numbers in the range 1 to 2' in a 2* x 2' assignment 


matrix (Fig. 6a). 


Induction Step : Consider the blocks along the forward 
diagonal of Fig. 6b. The diagonal elements of the shaded 
blocks form a permutation in the range 1 to 2' (induction 
hypothesis). We shall now show that the values along the 
forward diagonal of the unshaded blocks form a permuta- 
tion in the range 2'+1 to 2't!, Notice that the diagonal 
elements of the unshaded blocks are generated from the 
elements of the back diagonal of the matrix of Fig. 6a ac- 
cording to the following equation: 


Alj]{h] = 21 +1 — Apa +1 — je 
V7,k € diagonal elements of the unshaded blocks. 


Since A[2' + 1— j][k] lists the elements of the back diago- 
nals of the assignment matrix of dimension 2' x 2' and is a 
permutation in the range 1 to 2', the diagonal elements of 
the unshaded blocks must be.a permutation in the range 
2* + 1 to 2*t! (by Lemma 3.1.1). This completes the proof 
that the forward diagonal is a permutation. 


We shall now prove that every column of Fig. 6b forms a 
permutation. 


Basis : For N = 4, every column of the 4 x 4 matrix of 
Fig. 5a is a permutation. 
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Hypothesis : The columns of the 2* x 2‘ matrix shown in 
Fig. 6a are permutations in the range 1 to 2°. 


Induction Step : Consider a single column of blocks in 
Fig. 6b. From the induction hypothesis the columns of 
the shaded blocks form a permutation in the range 1 to 2°. 
From Property 3 (Section 3) it is clear that each shaded 
block generates one unshaded block according to 


Alj][k] = 2°*2 +1 — A[2* +1 —3][k] 


Vj, k € unshaded blocks. Therefore A[(j|[k] are unique in the 
range 2'+1 to 2*t! and by Lemma 3.1.1 form a permutation. 


Finally, we shall prove that every row of the assignment 
matrix A also forms a permutation. 


Basis : By inspection of Fig. 5a, for N = 4, every row is 
a permutation. 


Hypothesis : The rows of a 2' x 2° assignment matrix 
(Fig. 6a) are permutations. We first show that rows of the 
lower right quadrant when concatenated with the rows of 
the upper left quadrant form a permutation in the range 1 
to 2' (Fig. 6a). The reasoning is as follows. By Property 4 
(Section 3), the lower left quadrant is generated from the 
lower right quadrant by performing an x. segment image 
permutation on the columns of the lower right quadrant. 
This transformation maps row j of the lower right quadrant 


Nby2 = N j 2; 
/*Generate the lower-right quadrant* / 
for (i = 1;i< N;i+= 2) { 
il = Nby2 + ((i- 1) / 2); 
Alil][Nby2] = N - i; 
Alil][Nby2 + 1] =N-i+1;} 
for (i = 1;i < (Nby2-1) / 2+ 1;1++) { 
il = sequenceli - 1]; 
x = Nby2 + (i- 1) * 2; 
y = Nby2 + (i * 2); 
for (a = 0; a < (Nby2 / il); a++) 
for (b = 0; b < il; ery) { 


A|Nby2 + (a + y* il - b- 1][y] = 
A[Nby2 + (a * il) + b][x}; 

A[Nby2 + (a+ 1) *il-b-1]fy +1] = 
A(Nby2 + (a * il) + b][x + 1]; } 


} 


/* Generate the remaining quadrants */ 
for (i = 0; i < Nby2; i++) 
for (i1 = 0; il < Nby2 / 2; i1++) { 
Alil + Nby2 / 2][i] = Alil + Nby2]|i + Nby2); 
A[il][i] = Ail + (3 * Nby2) / 2][i + Nby2];} 
for (i = 0; i < Nby2; i++) 
for (il = 0; il < Nby2; i1++) { 
A[il + Nby2]{i) = N + 1- A|Nby2 - 1 -il]/i]; 
Ajil]{i + Nby2] = N+ 1- A[N - 1 - il][i + Nby2];} 


Algorithm 1: Generation of Assignment Matrix 


to row (N + 1-— 9) of the lower left qaaarant (l<j< %). 
Thus the numbers in row (2 + 1-— j) of the lower left 


quadrant are identical (though not in the same order) with 
those in row 7 of the lower right quadrant. The numbers in 
row j of upper left quadrant are obtained from row (7+ Ney 

1 — j) of lower left quadrant by Property 3. Therebirs, 
row 7 of the upper left quadrant cannot have numbers in 
common with the numbers in row (© +-1 — 7) of the lower 
left quadrant, and hence are also different to the numbers 
in row j of the lower right quadrant. Furthermore, row j of 
the upper left quadrant is a permutation. This proves that 
the rows of the lower right quadrant when abutted with the 
rows of the upper left quadrant, form a permutation in the 
range 1 to 2°. We will use this result in the induction step 
to show that the rows of Fig. 6b are indeed permutations. 


Induction Step : We have proved in the hypothesis step 
that the rows of the upper left quadrant and the lower right 
quadrant of Fig. 6a when abutted form a permutation. 
Thus, the shaded top row of Fig. 6b is a permutation. By 
Lemma 3.1.1, the bottom unshaded row in Fig. 6b is a 
permutation in the range 2'+ 1 to 2'+1. Thus the whole 
of the bottom row, the shaded as well as the unshaded, 
together form a permutation. Using a similar argument, 
every row of the assignment matrix can be shown to be a 
permutation. 


Hence every row, column and the forward and backward di- 
agonal of the assignment matrix A generated by Algorithm 
1 is a permutation. wi 


Algorithm 1 has a complexity of O(N”) as each cell of the 
matrix A is visited exactly once and there are N? cells. The 
assignment matrix is superimposed on the data matrix to 
obtain the memory module assignment for each element of 
the data matrix for conflict-free access to rows, columns 
and main diagonals. That is, if the number corresponding 
to a,; is m in the assignment matrix, then element aj; is 
stored in memory module m. 


procedure generate-quadrant(N) 


{ 
y=1jr1=0;7 = (=: 
Alii} = v 
while (there is an empty cell) { 
y = (y)mod(N) +1: 
if (A[(z — 1)mod(N)|[(j + 1)mod(N)] is empty ) { 
a (« — 1)mod(N): 
j = (9 + 1)mod(N): 


Ali|[a] = y: 

else { 
t = (¢+1)mod(N);: 
Alt][a] = y: 


} 
} 


Algorithm 2: Procedure to generate assignment matrix for 
odd N 


106 


3.2 Odd Values of N 


In this case, where N is odd, the solution of the magic 
square problem is trivial. Let the cells of the N x N matrix 
be labelled as a;;, where 0 < j < (N —1). Algorithm 2 
generates the desired assignment matrix. The proof that 
Algorithm 2 generates a permutation can be found in [2]. 
An example of an assignment matrix for N = 7 is given in 
Fig. 7. 


4 Conclusion 


In this paper we have presented an algorithmic solution to 
the problem of aligning data for conflict-free access to rows, 
columns and the main diagonals. The method presented in 
this paper for generating module numbers can be used in 
the table lookup technique as employed in GF11 [1]. How- 
ever, it is not practical for large matrices. A skew function 
therefore needs to be extracted from the information in the 
assignment matrix so that, given the indices, the function 
assigns the memory module. It appears that this function 


is nonlinear. An alignment interconnection network that 
will implement the permutation defined by the skew func- 
tion can then be synthesized. The feasibility of modifying 
the assignment matrix to accommodate conflict-free access 
to broken diagonals and Nz x N2 sub-matrices, also needs 
to be studied. Furthermore, the possibility of N taking 
any even value, not necessarily a power of 2 needs to be 
considered. 
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Fig. 5 Assignment Matrices 
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ABSTRACT 


During the execution of a program on a parallel machine 
run—time overhead incurs from activities such as scheduling, 
interprocessor communication and synchronization. This 
overhead is added to the execution time in the form of proces- 
sor latencies and busy watts. As overhead increases, the 
amount of parallelism that can be exploited decreases. We 
consider two models of run-time overhead. In the first model 
overhead increases linearly with the number of processors 
assigned to a parallel task. In the second case, overhead is 
logartthmic on the number of processors. We discuss ways of 
computing optimal or close to optimal number of processors 
for each case, as well as critical task size. 


1, Introduction 


The overhead involved with the simultaneous applica- 
tion of many processors to the same task can be very 
significant [PoKu87], [Poly86], [Rein85], [Cytr85]. So far 
most of the existing parallel processor systems have not 
addressed the overhead issue adequately nor have they taken 
it into account either in the compiler or in the hardware. 
On the Cray X-—MP, for example, microtasking can be 
applied at any level, although it has been shown that below 
a given degree of granularity microtasking results in a slow- 
down [Cray85]. 

In this paper we analyze two widely used models of 
overhead and their impact on the degree of parallelism that 
we can exploit. Using these models we can compute an 
approximation to the optimal number of processors for a 
given parallel task. This is also equivalent to computing the 
minimum size of an allocatable task. With these models we 
then perform some measurements using simple parallel 
loops. Finally we discuss ways of computing approximate 
execution times of tasks at compile time. 


2. Overhead of Parallel Tasks 


As our machine model we choose a p—processor shared 
memory or message passing system with homogeneous pro- 
cessors. If T, and T yp are the serial and parallel execution 
times (on p processors) for a program PROG respectively, 
then we define the speedup S, of PROG on a p processor 
system to be 5, = ye ee The efficiency of execution of 
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PROG is then defined by E, = S,/p [Kuck78]. It is clear 
that for each program and for each system 1 < 5, <p and 
0o< E, < 1 [Bane79]. It is usually hard to precisely compute 
T, and T p at compile-time. However, close approximations 


are adequate in estimating overhead and performing related 


optimizations. 


A program is composed of a collection of tasks where 
tasks can be sertal or parallel. Any pair of tasks can be data 
dependent or independent on each other. The tasks and the 
dependence relationships among tasks define the task graph 
for a given program. Parallel tasks are composed of a set of 
independent processes. Processes are serial entities i.e., they 
always execute on a single processor. We assume that the 
execution of a process is nonpreemptive. 


We will consider the worst-case overhead incurred 
with the parallel execution of processes. This is the familiar 
fork/join operation which is employed, for instance, in gen- 
erating several processes from a parallel DO loop. Such 
parallel loops can be specified by the programmer or can be 
the result of program restructuring. A particular type of 
parallel loops which is used in this paper is the DOALL loop 
[KLPL81]. The iterations of a DOALL loop are data 
independent and therefore can be assigned to different pro- 
cessors and can be executed in any order. A DOALL loop 
defines a task; one or more iterations executing concurrently 
on the same processor define a process. Parallelism at the 
task level can be utilized by executing different tasks simul- 
taneously. This is also known as high level spreading. 


The question of interest to us is the estimation of the 
critical process stze or CPS. Informally, the CPS can be 
defined as the minimum size of a process for which the exe- 
cution time on a single processor is equal to the associated 
overhead. When a parallel task is distributed to several pro- 
cessors at run-time, it incurs a penalty or overhead that 
limits the degree of exploitable task granularity. Consider 
the parallel execution of a DOALL loop whose iterations are 
spread across processors at run-time. Run-time overhead 
may include several activities that do not occur during serial 
execution. All processors involved, for example, will have to 
access the ready—task queue in a serial mode since it is a 
critical section. Different processors will get different itera- 
tions of the same loop: At the end of the loop all processors 
involved must pass through a barrier to determine that the 
loop has been executed and that they are allowed to proceed 
with the next task [PoKu87]. The fetching of instructions at 
run-time can also be considered part of the overhead. Espe- 


cially with self-scheduling, instruction prefetching cannot 
work since, by definition, it is impossible to predict which 
processor will execute the next task or the next iteration of a 
loop. All these activities prolong the parallel execution time 
of a program. None of the above occurs during serial 
execution. This overhead, as would be expected, makes it 
inefficient to execute in parallel small tasks or to use a very 
large number of processors on even large parallel tasks. If 
the task is not large enough to amortize the overhead, we 
may end up with a parallel execution time which is larger 
than the serial execution time. 


The tasks involved in an instance of high level spread- 


ing can be thought of as iterations of a DOALL loop whose 


loop—body contains conditional statements, and therefore 
different iterations have different execution times. Therefore 
high level spreading can be reduced to the parallel loop case 
where the number of iterations equals the number of 
independent tasks in that set. Since it is impossible to pre- 
cisely estimate the execution time of a loop body with condi- 
tional statements, either at compile-time or at run-time, we 
assume an average or a worst case value as discussed in Sec- 
tion 5. For the moment let us assume that the loop—body for 
a given parallel loop has a constant execution time. 


Consider a DOALL loop with N iterations whose body 
execution time is B units, and which is to be executed on a 
system with p processors. Let us see how one can compute 
an approximation to the CPS, i.e., the minimum number of 
iterations (chunk) allocated to each idle processor. Each time 
a processor dispatches one or more iterations of the loop it 
incurs an overhead o. The question is to determine the 
minimum chunk size k for which 5, > 1, or equivalently, 


NB 
Ss, = er Ns 
N/k 
—— (kB +0) 
P 
After simplification we get 
o 
esac aaa 
B(p — 1) 


As one would expect, the chunk size is inversely proportional 
to the number of processors executing the loop. For exam- 
ple, if o = B and p = 2 then at least k = 2 iterations should 
be allocated each time. In what follows we concentrate on 
determining the optimal number of processors that should 
be allocated to a given parallel loop. 


3. Two Run—Time Overhead Models 


To analyze the run-time overhead we use two conjec- 
tures that have been backed by empirical results (Cytr85], 
[LeKK86], [Ston87]. The first conjecture states that during 
the parallel execution of a task the run-time overhead is 
Jinearly proportional to the number of processors involved. 
The second conjecture states that the run-time overhead is 
logarithmically proportional to the number of processors. 
Let us consider two examples where these two conjectures 
are valid. 


Consider the execution of a DOALL loop on a set of p 
processors connected to a common bus. If the iterations of 
this DOALL are spread among the p processors, then all p 
processors must execute a join operation before they are 
allowed to proceed with the next task. If two lexically adja- 
cent DOALLs L, and L, operate on the same array, it will 
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be necessary (in general) to execute a barrier synchroniza- 
tion (join) between L, and L,. Thus all processors executing 
L, must finish before they start on L,. Clearly the execution 
of a barrier operation on a bus—based multiprocessor 
involves O(p) steps in the worst case {Ston87]. In a 
dynamic scheduling environment this overhead will also 
occur during dispatching of iterations, assuming all proces- 
sors start on a loop at the same time. 


If the same example is used for p processors intercon- 
nected in a tree structure, the barrier operation will take 
O(logp) steps to complete. A more real-world example of 
logarithmic run-time overhead are shared memory mul- 
tiprocessors such as the Cedar and the Ultracomputer which 
employ multistage interconnection networks. If no special 
hardware is used and if synchronization is done through the 
shared memory, then the logarithmic overhead case applies 
here as well. The results presented in this section can be 
used by the compiler to draw exact or approximate conclu- 
sions for each task in a program, and can be used at run— 
time to avoid inefficient processor allocations. 


3.1. Run-time Overhead is O(p) 


As mentioned above we can identify a parallel task 
with a DOALL loop without loss of generality. Let T, and 
i denote, as usual, the serial and parallel execution time of 
a given task. Let N be the number of iterations of a 
DOALL loop and B the execution time of the loop—body. If 
the loop—body has a varying execution time the procedure of 
Section 5 can be used to derive a worst case or average value 


for B. 


In this section we consider the case where the run-time 
overhead is linearly proportional to the number of proces- 
sors assigned to a parallel loop. Let o, be the run-time 
overhead constant which in general depends on the charac- 
teristics of the code and the machine architecture. The com- 
piler can supply the value of o, for each loop (parallel task) 
in the program. The serial execution time of a loop with N 
iterations and a loop—body execution time of B would be 
T,=NB. The parallel execution time then on p—processors 


would be 
N 
“lp aoe 
Pp 


Consider (1) as a function of p. If overhead was zero, (1) 
would be an integer—valued decreasing function. Since (1) is 
not continuous it is not amenable to analytical study. We 
can approximate the function in (1) by a continuous func- 
tion, by eliminating the ceiling. We thus get 


T(p) = NB/p +0,p. (2) 


T(p) is a continuous real function in the interval (0, +00 ), 
with continuous first and second derivatives. Therefore we 
can study its shape and determine the point where overhead 
becomes minimal. In other words we want to find the value 
of p for which (1) becomes minimum and therefore the 
speedup of that task is maximized. The minimum value is 
given by the following theorem. 


T, 


(1) 


Theorem 1. T(p) in (2) is minimized when the task is exe- 
cuted on a number of processors given by 


Pp, = VNB /o,. (3) 


Proof: First we show how (3) is derived and then prove that 
it is indeed the optimal value for that task (loop). Consider 
(2) which is an approximation to the parallel execution time 
defined by (1). T(p) has a first derivative 

ST) T(p)=-~— +9,. (4) 

dp p- 

The local extreme points of (2) are at the roots of its first 
derivative, that is, at 


Poa = +VNB /0, (5) 


and since we are only interested for values in the interval 
(0, +00), we discard the negative root p,. The second deriva- 
tiveof T(p)is 


dT(p) tn 2NB 
2 3 
dp p 


T ‘(p) is always greater than zero and therefore the extreme 
at (p,, T(p,)) is a minimum, where p, is given by (5). If p, 
is an integer that divides N, then the parallel execution time 
T - is also minimized and it is given by 


(8) 


NB 
T, =———— +0, VNB /o, = 
"  \/NB/o, 


VNBo, +VNBo, =2V NBo,. (7) 


Indeed if fp is the parallel execution time for any other p, 
then p can be expressed as p = c\/ NB/o, where c is a 
positive rational number. Then £, < T,; or equivalently, 


2\/NBo, <V(NB)*o, / ¢'(NB) + Vc'o-(NB) /0,(8) 
and if we substitute x = NBo, in (8) we have 
Va CV ei eV os. — O0<a(l +0 —2c°) 
and since z > 0, we get (1 — °*) > 0 which is always true. 


Therefore p, is the optimal value for T(p) and in certain 
cases the optimal value for T ? 


Corollary 1. For 0, > (NB)/4 the approximation function 
T(p) defined in (2) satisfies 


T(p) = NB (9) 


for any integer p + 0. 
Proof: By substituting T(p) from (2) in (9) we have 


NB 
——~+0,p > NB or 
P 
(10) is a quadratic equation of p and since 0, > 0, the ine- 
quality in (10) is always true if the determinant D of the 
equation in (10) is negative, i.e., 


o,p' + p(NB) + NB > 0(10) 


NB 
D =(NB)' —40,(NB) <0 which givesus o, > ——.™ 
4 


Corollary 2. If 0, >(NB)/k then the parallel execution 
time for p > k is greater than the serial execution time, i.e., 


T : ee ee 
3.2. Run-Time Overhead is O(logp) 


Let us assume that the run-time overhead is loga- 
rithmically proportional to the number of processors 


assigned to a parallel task. Therefore, in this case the paral- 
le] execution time is given by 


N | 
ZT = | — |B +0, logp. (11) 


Pp 


To determine the optimal number of processors that can be 
assigned to a parallel task, we follow the same approach as 
in the previous case. Again since (11) is not a continuous 
function we approximate it with | 


NB 
T'(p) = —— +9, logp (12) 
P 
which is continuous in (0, +00), with continuous first and 


_ second derivatives. The corresponding theorem follows. 


Theorem 2. The approximate parallel execution time 
defined by (12) is minimized when 
NB 
p r) —— oe 


oO, 


Proof: The first derivative of (12) is given by 


dT (p) 7 NB % 


at) ==. (13) 
dp poP 
T (p) has an extreme point at 
NB 
Po = . 14 
a 4 


The second derivative of (12) at p, is 

2NB —o,p ; o° 
— and T(NB/o,)= : 
p (NB) 
Therefore T(p) has a minimum at p = p,. ™ | 


T (p) = 


However p, is not necessarily a minimum point for (11). We 
can compute an approximation to the optimal number of 
processors for (11) as follows. Let ¢€ = Ip, | —p, where 
0<¢<1. Then the number of processors p, that "minim- 
izes the parallel execution time T,, in (11) is given by 


(lel ite <os 
?, = 


[p, | if ¢ > 0.5 (15) 


where p, = NB/o,. In the next section we see that (15) is a 
very close approximation to the optimal number of proces- 
sors for (11). The overhead problem was studied in a simi- 
lar context in [Cytr85]. 


4. Measurements 


We can use the above models to derive an approximate 
estimate of the effect of run-time overhead on the degree of 
usable parallelism, and thus on execution time. We used (1) 
to compute the actual execution time of a parallel task, and 
(2) to compute its approximation function for the linear 
overhead case. Similarly (11) and (12) were used for the log- 
arithmic overhead case. 


Figure 1 illustrates the execution time versus the 
number of processors for a DOALL with N=150 and B=8 
under (a) linear overhead, and (b) under logarithmic over- 
head. Figures 2, 3, and 4 illustrate the same data for three 
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different DOALLs, whose N and B values are shown in each 
figure. The solid lines plot the values of T »? the actual paral- 
lel execution time. Dashed lines give the approximate execu- 
tion times T(p). For these measurements a value of 0, = 4 
was used. The overhead constant although optimistically 
low, is not unrealistically small for (hypothetical) systems 
with fast synchronization hardware. In all cases we observe 
that as long as p < N (which is the case of interest), the 
difference between the values of the approximation function 
T(p) and the actual parallel execution time T, is negligible. 


Looking at Figures la and 3a we observe that when the 
loop body is small, the associated overhead limits severely 
the number of processors that can be used on that loop. For 
these two cases for example, only 1/10 and 1/40 of the ideal 
speedup can be achieved. When B is large however the over- 
head has a less negative impact on performance. For the 
case of Figure 2a for instance, 1/2 of the maximum speedup 
can be obtained in the presense of linear overhead. The same 
is true for Figure 4a. In all cases the logarithmic overhead 
had significantly less negative impact on speedup. 


5. Deciding the Minimum Unit of Allocation 


Estimating the projected execution time of a piece of 
code (on a single processor) can be done by the compiler or 
the run-time system with the same precision. Let us take 
for example the case of a DOALL loop without conditional 
statements. All that needs to be done is estimate the execu- 
tion time of the loop body, and let it be B. For our purpose, 
the exact number of loop iterations need not be known at 
compile-time. Since we know the overhead for the particular 
machine and the structure of a particular loop, we can find 
the critical block size for that DOALL that is, the minimum 
number X of iterations that can be allocated to a single pro- 
cessor such that S, > 1. This number X can be “attached” 
to that DOALL loop as an attribute at compile-time. Dur- 
ing execution the run-time system must assign to an idle 
processor X or more iterations of that loop (but no less). In 
case X < WN the loop is treated as serial. 


Let us consider the code inside a DOALL loop. The 
control—flow graph of a code module with conditional state- 
ments can be uniquely represented by a directed graph. Con- 
sider for example the code module of Figure 5 which consti- 
tutes the loop body of some DOALL. The corresponding 
control-flow graph is shown in Figure 6. Since there is no 
hope of accurately estimating the execution time either in 
the compiler or at run—time, we choose to follow a conserva- 
tive path. The execution time of each basic block By wey B, 
can be estimated quite precisely. We take the execution time 
of the loop body to be equal to the execution time of the 
shortest path in the tree. 


The shortest path can be found by starting from the 
root of the tree and proceeding downwards labeling the 
nodes as follows. Let ¢; be the execution time of node »,, 
and I, be its label. The root v, is labeled |, =¢,. Then a 
node v, with parent node », is labeled with |, = L + ¢;. As 
we proceed we mark the node with the minimum (so far) 
label. In case we reach a node that has already been labeled 
(cycle) we ignore it. Otherwise we proceed until we reach the 
leaves of the tree. Note that the labeling process does not 
have to be completed: If at some point during the labeling 
process the node that has the minimum label happens to be 
a leaf, the labeling process terminates. The path 7 that con- 
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sists of the marked nodes is the shortest execution path in 
that code. The number of iterations required (conserva- 
tively) to form the critical size is a function of the number of 
processors as shown in Section 3. B, the execution time of 7, 


dat B, 
if Ci then By 
else Bg 
Bo 


if C, then goto 1 
else if C, then By 


else Be 
exit 
Bg 
if C, then Be 
else By 


Figure 5. An example of conditional code. 


Figure 6. The control flow tree of Figure 5. 


is given by the label of the last node of path 7. A less conser- 


vative approach would be to take the average path length, 
assuming all branches in the code are equally probable. In 
the example of Figure 6 the above procedures give us 
B =12 and B = 33.33 respectively. 


6. Conclusions 


Run-time overhead is an important issue for parallel 
processor machines. Even moderately low run-time over- 
head can significantly limit the amount of program parallel- 
ism that can be exploited. In this paper we analyzed two 
models of run-time overhead and we computed the optimal 
number of processors that can be used for each case. The 
measurements indicated that the approximations used model 
closely the exact formulation of the problem. 


Acknowledgements: The author would like to thank the 
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Abstract 


An MIMD architecture dubbed Microflow is presented 
which combines very low cost communication and syn- 
chronization with the latency avoidance techniques of 
uniprocessor architectures. The communication and syn- 
chronization is implemented with extremely fast message 
passing by having targets of messages be general purpose 
registers. Communication between adjacent nodes can be 
accomplished in the time it takes to execute one instruc- 
tion. 


A Microflow processor contains multiple windows, each 
containing a context. This mechanism enables high per- 
formance servers to be constructed in software while ena- 
bling the server to have high priority and low overhead. 


The message passing elements integrate smoothly with 
RISC or even moderately horizontal instruction sets, ena- 
bling Microflow to perform well even on those parts of the 
code which do not parallelize weil. 


1. Introduction 


In this paper, an MIMD architecture dubbed Microflow 
is presented which combines very low cost communication 
and synchronization with the latency avoidance techniques of 
uniprocessor architectures. The name Microflow is derived 
from extremely fine-grained message passing which extends 
functional unit style data- and control- flow synchronization 
and communication across processors. The peak communica- 
tions performance per processor in Millions of Transmissions 
Per Second (MTPS) is equal to the processor MIPS rate. 


Assume a switch transition rate of four times the proces- 
sor instruction rate. This message transmission rate means 
that: 


e Neighboring processors can communicate and syn- 
chronize in about a few instruction cycles. 


e Monitors (and other forms of remote procedure 
calls) can be invoked in a few instruction cycles 


And, on a computer with a thousand processors: 


e Most distant processors can communicate and syn- 
chronize in 4 machine instructions. 


e Computer-wide synchronization can be achieved in 
tens of instructions. 


e Summing, enumeration, and-trees and or-trees can 
also be achieved in tens of instructions. 


The speed at which a parallel processor performs com- 
munications and synchronization is an important metric of 
performance. However, when the degree of parallelism is 
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larger than the number of processors, then the computation 
can be broken into parallel chunks, each chunk can be most 
efficiently run sequentially. Also, when the degree of paral- 
lelism is smaller than the number of processors then it is the 
speed of the uniprocessor which will increasingly determine 
computation performance. Unlike other fine-grained archi- 
tectures that we are aware of, Microflow performs serial code 
at uniprocessor speeds. Hence, all of the traditional latency 
avoidance speedup techniques are applicable, including: 
memory hierarchy, prefetching, pipelining, word parallel 
arithmetic and so forth. In addition, compiler optimizations 
enable each processor to execute only those instructions 
which are necessary for its part of the computation. 


The Microflow design conservatively extends the tech- 
niques for high-performance uniprocessor by integrating 
switching hardware and augmenting processor design with 
message passing. Although the hardware extensions are con- 
servative, both new programming language constructs and 
new compilation techniques are necessary to fully exploit the 
Microflow Architecture [Sol87]. 


Our benchmark computation is to efficiently operate on 
pointer-based data structures, although Microflow is also very 
effective at both systolic processing and coarse grain parallel 
processing. Pointer-based computations arise in symbolic 
processing as in compiler optimization, computer-aided 
design, data bases, and artificial intelligence applications. 


2. Design 


We shall define an integrated architecture as one which 
combines both message passing and shared memory. Logi- 
cally, message passing and shared memory are equivalent in 
the sense that either technique can be used to simulate the 
other. Integrated architectures have advantages over either 
shared memory or message passing only if the hardware per- 
formance of each is significantly better than simulation in 
software. We describe how these performance advantages 
are attained in Microflow. 


2.1. Implementation of integrated architecture 


In the succeeding sections, the implementation of loads 
(shared memory) and of sends (message passing) are 
described. All large scale parallel processors have an inter- 
connection network, as shown in Figure 1. If the computer is 
a message passing system, then the interconnection network 
routes network messages from processors to nodes; if it is a 
shared memory system, then the interconnection network 
routes network messages between the memory modules and 
the processors. Hence, in either case it is network messages 
which are routed on the network. 


Interconnection Network 


Figure 1 — General large scale parallel processor scheme 


2.1.1. Implementation of load 


Loads are implemented in Microflow the same way as 
on other high performance processors. Associated with each 
register is a Full/Empty bit. Registers with valid contents are 
marked Full; Empty denotes that the register is reserved for 
the result of an outstanding load. / 


A load performs the following functions: 


(1) The register is marked as Empty. 

(2) A network message is sent off-chip and routed to 
the addressed memory module. | 

(3) The memory module fetches the value and replies 
with a return network message which is routed to 
the originating processor. 

(4) The value is put in the register and the register is 


marked as Full. 


As in other high-performance computers, the processor 
continues to issue instruttions until an instruction is encoun- 
tered which requires a register which is marked Empty. Two 
advantages are obtained: 

e Multiple loads with different target registers can 
execute simultaneously 


The use of both control flow pipelining (instruction 
counter) and data flow (returning load values) is 


more efficient than either mechanism. ! 


2.1.2. Implementation of send 


Before we discuss the architectural implementation, 
message passing is presented on a more abstract level. The 
semantics are that proc; sends a value to a communications 
variable (cv) at node;. As with registers, communications 
variables are marked with Full/Empty. The sequence for a 
send is: 


(1) The originator, proc; sends a network message to cv 
at node;. 


(2) 


The processor at node;, proc; issues a receive 
instruction which marks the receiving cv as Empty. 


(3) When the message is at node j, and the cv to which 
it is destined is marked Empty, the value is loaded 
into the cv and the cv is marked Full. 


Note that since the originating and destination proces- 
sors operate asynchronously, steps 1 and 2 may be inter- 
changed. Nevertheless, the Full/Empty interlock ensures 
correct operation. 

’ The instruction counter, while limiting choice, speeds up the execution by fetching 


and decoding instruction prior to the arrival of data. In pure data flow schemes, it is the ar- 
rival of data which triggers an instruction fetch, thereby adding a delay to data execution. 
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Architecturally, the issue arises whether to map these 
cvs to memory locations or to registers. If the cvs are mapped 
to memory, then the cv space is as large as the variable space, 
but access is slow. Alternatively, if cvs are mapped to regis- 
ters, there is only a small number of them available, but they 
are at least an order of magnitude faster. Moreover, if the 
receiver is ready before the sender, neither memory nor net- 
work bandwidth is consumed while the receiver busy waits. 
Given sufficiently fast context switching, work could even be 
done in the interim. 


In Microflow, communications variables are mapped to 
registers. The additional hardware is negligible since, in 


effect, a send looks like a remote load.” It is also worth noting 
that the target of a send is a general purpose register; this 
enables the compiler to choose the allocation between com- 
putation and communication registers that best suit the pro- 
gram. 


The implementation of send ensures that a processor can 
send or receive a message every cycle. Hence, peak 
transmission rate is equal to instruction rate, enabling compu- 
tations to exhibit speedups when executing only 1-3 instruc- 
tions between transmissions. 


The above arguments show that message passing and 
shared memory easily integrate into the same architecture. 
The judicious use of message passing also reduces network 
traffic. For example, a shared memory implementation of 
producer-consumer requires three uni-directional network 
messages (one for the write, and two for the read). Using 
message passing, the same function is performed by one net- 


work message.” 


2.1.3. Communication variable queues 


Associated with each communications variable is a 
queue which can contain up to P elements. The queue serves 
two purposes: 


(1) It allows each processor to send to a cv without first 
getting permission. This avoids an extra round trip 
of network latency.* 


(2) It reduces the amount of work each processor per- 
forms by requiring it to look one place for "work". 
Hence, no polling of cvs is required. 


There is insufficient room on the processor chip to store 
the queue. Therefore, both the initial segment of the queue 
and the queue management control are implemented in the 
cache. Once the queues grow beyond a certain size, the tail 
of the queue is stored in main memory. This size is chosen so 
that main memory references are infrequent. 


Another issue that arises is how the processor and cache 
chip communicate about the queue. Every time a processor 
issues a receive or a load, the register and window specified 


_are provided to the cache. The cache maintains a duplicate 


2 Since it is illegal for the software to use a register for both a cv and a load target, one 


set of Full/Empty bits is sufficient for both purposes. 


3 The example is simplified since a) multiple reads might be needed if the consumer 
started reading before the producer produced the data and b) that there is no handshaking to 
ensure the message is not lost. The cost of handshaking can be made arbitrarily small by as- 
sociating queues of length L and handshaking after every L items. | 


4 Large fan-in cvs do not imply a serialization point since the number of sends to a cv 
may be of size P in the worst case, but is usually much smaller. For example, breadth first 
search must assume a graph-node at a processor node is connected to a node at every other 

processor, even if this will rarely happen. 


set of Full/Empty bits. Whenever the cache receives a value 
for a register which is empty, that value is right away sent to 
the processor chip, otherwise it is cached. 


2.2. Context switching 

Rapid context switching in Microflow is used for two 
principle reasons; latency avoidance and the implementation 
of servers. The use of fast context switching for latency 


avoidance is well known. However, we believe its applica- 


tion to servers is new. 


Off Chip Connnection 
Figure 2 — Multiple windows per processor 


The rapidity of context switching is achieved by replica- 
tion of hardware resources. Each processor chip contains W 
sets of registers, called windows and one ALU (see Figure 2). 
Independent instruction streams are executed in Microflow’s 
windows, unlike the overlapping windows used in some 
RISC architectures. Each window contains the entire proces- 
sor state for a context; at most one context (or window) is 
designated as current. The multiplicity of registers means 
that not only can context switching be performed without 
saving or restoring registers but that the pipeline does not 
even need to be flushed. 


i ar 


| oe eel 

Program Counter Ld 
Countdown timer L___ 
Quanta Ld 


Instruction Buffer 


Status 1] 
ORE 2 


Figure. 3 — Registers per window 


Each window contains a set of general purpose registers, 
a program counter, a countdown timer, a quanta size, an 
instruction buffer, a status register, and a Communications 
and Addressing Register (CAR). See figure 3. The CAR is a 
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multifield register which enables the construction of very 
complex network messages; however, most network mes- 
sages are produced by a single 3-address RISC instruction. 


timer = 0 


Next Eligible 
window S 
5 
g 
Reg becomes Full [&. 


Figure 4 — State diagram for windows 


Attempted use of 
empty register 


Blocked 


A window which is currently active remains active until 
either its countdown timer reaches zero (expired) or it 
attempts to use a register which is marked Empty (blocked). 


The context then switches to the next window which is nei- 
ther blocked nor expired and the countdown timer on the pre- 
vious window is reset to the quanta size. This context switch 
and reseting of timer the takes place concurrently with 
instruction execution. Hence, no cycles are lost due to con- 
text switching; and it is possible, for example, to run HEP 
style instruction interleaving [Smi81] by setting the quanta to 
one in all windows. 


The processor instruction set is a normal RISC instruc- 
tion set augmented with instructions for constructing complex 
message (using the CAR) and the instructions: 


send dest_processor,dest_register, value 
receive R; 


Where dest_processor and value are register contents and 
dest_register is a literal. To send a message to a different 
window, the window field of the CAR is set immediately 
before the send. 


The first window contains the application code and is 
expected to consume the lion’s share of the processor cycles. 
Other windows contain server code that are invoked by the 
application and proceed independently. Example of servers 
include fetch&add, and-trees, and or-trees. 


The server acts on a demand basis. Since context 
Switching is free (if no other window has any work to do, a 
window will "switch" to itself), a small quanta is assigned to 
the application thread, with larger quanta assigned to the 
server thread. The application thread will execute only when 
the server has nothing to do, and the server (which is infre- 
quently busy) will respond almost at once. 


Of course, server code could be executed in the applica- 
tion window. But message arrival is asynchronous requiring 
the application window to performed only message code (and 
be idle waiting for messages) or to continue running the 
application and interrupt on message arrival (implying a large 
overhead). Neither of these alternatives are acceptable. By 
using separate windows, the application can request a service 
before the service needs it, thereby overlapping the applica- 
tion with the server and avoiding latency. — 

The organization of separate server windows (and the 


low cost of embedding trees in the Microflow network) 
makes servers sufficiently fast to perform combining opera- 
tions in software (for fetch&add operations), and to replace 
combining of reads with broadcast operations. We believe 
that this should speed up switch operation by a factor of two 
over combining switches [DKS85], while more easily ena- 
bling high fan-in/fan-out switches. Not only will the network 
run faster, but the software servers enable the construction of 
trees that perform blocking functions. For example, the 
and-tree requires all processors to supply a vote before any 
result is returned. Such algorithms require busy waiting with 
network combining techniques. 


2.3. Node design 


Each node consists of a processor, memory, and a 
switch as shown in figure 5. 


Processor 


Cube Connection 


Figure 5 — A Microflow Node 


Each memory module is physically adjacent to some 
processor, although all memory is globally addressable. 
‘Since the processor node has a cache, the memory can be 
constructed from inexpensive dynamic RAM chips. Since 
the cache is snoopy, the processor can cache shared variables 
whose location is in the local memory. This enables the 
accessing of "hot-spot" variables to occur at cache rates 
rather than at the much slower memory speeds. The switch 
design performs only simple routing most of the time; when 
the network becomes congested the switch starts to perform a 
deadlock avoidance algorithm. This enables the switch to be 
built for the maximum speed possible, and as we shall see, 
this is particularly important on high fan-in/fan-out switches. 


2.4. Network 


The target applications and granularity of Microflow 
require that trees (for the purposes of control) be inexpen- 
sively embedded in the network, that is adjacent nodes in the 
tree would be adjacent in the network. We have chosen 
Cube-Connected Cycles (CCC) as our network [PrV81]. 
CCCs maintain many of the properties of Hypercubes, 
including logarithmic diameter and good performance on 
many Hypercube algorithms, but require hardware propor- 
tionate to the number of processors P. (A Hypercube 
requires hardware PlogP). 


Fast switches play an important role in fine-grain paral- 
lel processing. The switch shown in Figure 3 conceptually 
contains two cycle connections, one cube connection and one 
processor connection; however, k adjacent switches in a cycle 
can be coalesced into one k-ary switch. 


The number of wires in the interconnection network and 
especially switch design will require messages to be packet- 
ized. However, it is advantageous for a processor to have 
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parallel access to its memory. Therefore, the switch design 


merges k packetized processor ports into a single shared bus. 


Studies in uniprocessor systems with snoopy caches indicate 
that busses can support at least eight processors. For 
Microflow this number is likely to be somewhat lower since 
the bus is also being used for network traffic and because 
more bus activity is required because of the fine-grain of the 
application. 


CycleOut 


k-Switch 


cleIn 


CubePGubeP 9 
Figure 6 — k-ary switch with shared bus 


CubeP 
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We note that in addition to the single cycle bus access, 
by low interleaving the memory modules, a processor (or net- 
work switch) can perform stores at peak performance. This 
design enables this balance. performance to be achieved with 
a smaller number of total memory modules (over the design 
in Figure 6). 


3. Comparison to other architectures 


The technique of multiple windows for latency 
avoidance dates back at least to the peripheral processor units 
(PPU) on the CDC 6600 [Tho70] and was first (and very 
elegantly) proposed for parallel processors in CHOPP 
[SiB77], in which the logarithmic random access delay was 
masked by log(P) windows. Snoopy caches were first pro- 
posed by Goodman [Goo83]. | 


The use of Full/Empty bits on registers, or more com- 
plex schemes, also dates back at least to the CDC 6600, and 
probably much earlier. The first commercial computer to use 
Full/Empty bits on memory is the Denelcor HEP [Smi81], 
which used the bits as a means of performing fine-grained 
dataflow handshaking. However, a consuming process would 
have to busy wait if it was ready before the corresponding 
producing process. To eliminate busy waiting, I-structures 
[ArT80] provide a queue of waiting reads, which were 
automatically triggered when a write occurred. Microflow’s 
message structure is more general than I-structures (and is 
faster). For example, if a large number of reads are pending, 
I-structures produce a serial bottleneck — in Microflow, a 
broadcast tree can be constructed resulting in no bottlenecks. 


The Denelcor HEP also had Full/Empty bits on regis- 
ters, and processes running on the same processor could com- 
municate through sharing of register address spaces. How- 
ever, there was no way for processes running on different 
processors to effect each others register set, so that this 
mechanism was not very heavily used. 


The designs which Microflow is mostly closely related 
to in spirit are the Connection Machine [Hil85] and 
Message-Driven Processor (MDP) [DCC87]. The Connec- 
tion Machine achieves synchronization by two means: its 
SIMD instruction execution and a global network empty sig- 
nal which ensures that all messages have been delivered. In 
Microflow, fast point-to-point synchronization is the basic 
mechanism. When global synchronization is needed, trees 
are constructed in software. This mechanism enables a 
heterogeneous set of operations to occur (MIMD), for net- 
work accesses to be pipelined, for each processor to execute 
only those instructions relevant it, to take advantage of 
memory hierarchy (both faster and cheaper access). 


The MDP communicates only by messages, but simu- 
lates shared memory by having a message unit which is able 
to simulate reads and writes. The MDP messages are sent to 
objects which then must be invoked, resulting in delays to 
fetch instructions and a very limited register set which caused 
accesses to be made to local memory. To make these 
accesses as fast as possible, memory is implemented on chip 
thus restricting its size. The constraint of on-chip memory 
also limits the ability to take advantage of the memory hierar- 
chy, which reduces average access time, and its small size 
increases Communication and latency requirements. 


4. Performance Parameters 


The Performance Parameters for a Microflow architec- 
ture are described in terms of the basic instruction rate of a 
processor. Using current technology, a Microflow processor 
would be implemented as a RISC augmented with multiple 
independent windows and message passing instructions. Our 
unit of time, the cycle time, is the rate of instruction execu- 
tion: A network switch hop can be conservatively imple- 
mented in .25 cycles. This means that adjacent node can 
communicate in about 1 instruction time, for a k=4, pack- 
ets=4 switch. Table 1 shows the Microflow parameters. 


|processorinstruction | 
|swithhop tS 


1 
25 
ae 
Ee 


2 
memory read from furthest node (P = 2K) 


Table 1 — Microflow architecture speed parameters 


We have coded up a number of algorithms, including 
summing, enumeration, sorting (parallel quicksort), breadth- 
first search, parallel prefix on linked-lists, matrix multiply 
and inversion, and transitive closure. The number of server 
windows was between 0 and 3 (with an average of 1/2). 
Moreover, the number of registers needed were never larger 
than logarithmic in the number of processors. Hence a lim- 
ited number of communications variables seem to be needed 
in an integrated architecture. 


5. Conclusions 
~ We have described some of the effects of implementing 
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integrated architectures (those containing both shared 
memory and message passing) and discussed their perfor- 
mance advantages over either shared memory or message 
passing. The Microflow architecture is a very efficient imple- 
mentation of.an integrated architecture. Microfiow: 


e Provides extremely fast message passing by having 
targets of messages be general purpose registers. 


e Provides enqueuing of messages destined to a single 
register both to reduce handshaking delays across 
the network and to eliminate polling by the receiv- 
ing processor. 


e Uses windows not only for latency toleration, but 
for latency avoidance by extending hardware 
latency avoidance techniques to software. 


e Is extremely fine grained while maintaining the per- 
formance advantages on serial code. 
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Abstract—The performance of cache-based, shared-memory 
multiprocessors can suffer greatly from moderate cache miss 
rates because of the usually high ratio between memory-access 
and cache-access times. In this paper we propose a cache de- 
sign in which the handling of one or several cache misses occurs 
concurrently with processor activity. In multiprocessors, such 
lockup-free caches aggravate the memory coherence problem. 
The proposed design relies on a cache-block size of one word 
and as a result is simple and efficient. A multiprocessor archi- 
tecture, using lockup-free caches, is described and shown to be 
correct. Through performance models, we identify system con- 
figurations for which lockup-free caches are effective. Compiler 
techniques, to take advantage of the proposed design, are illus- 
trated at the end of the paper. 


1. INTRODUCTION 

Cache memories are commonly used to reduce the memory ac- 
cess latency for both data and instruction accesses. Caches can 
do this very effectively and economically [1]. In shared-memory 
multiprocessors caches are more important than in uniproces- 
sors because the individual processors of a multiprocessor must 
be connected to the shared memory through an interconnec- 
tion. Increased memory access latency and conflicts reduce the 
efficiency of each processor. Prefetching, which can reduce the 
appareit access latency visible to the processors, is also more 
dificult in multiprocessors because of the coherence problem 
[2]. 

It is possible to design caches which do not block the pro- 
cessor on an access miss——they are called lockup-free caches. In 
such designs the processor may continue sending requests to 
the cache both for data and instructions while the cache and 
the main memory system are resolving one or several previ- 
ous misses. Such a scheme was described by Kroft in [3] for 
a uniprocessor. When processors are part of a shared-memory 
multiprocessor, the design of lockup-free caches becomes diffi- 
cult because of the added problem of maintaining cache coher- 
ence. 

In this paper we describe a multiprocessor architecture which 
allows caches to be lockup-free and in which synchronization is 
enforced by means of “hardware-guarded” primitives. We show 
that the operation of the caches is, with a few exceptions, very 
similar to that of lockup-free caches of uniprocessors. Further- 
more, a straight-forward, snoopy cache coherence protocol can 
be used to enforce inter-cache consistency. The benefit of this 
architecture is that the overall cache miss penalty (i.e., the av- 
erage time a processor is blocked because of a cache miss) is 
reduced and hence a small block size can be used to minimize 
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overall cache-memory traffic. Goodman [4, 5] has argued that a 
block size of one word reduces memory traffic. Reduced memory 
traffic increases the possible system throughput for a given in- 
terconnection. In a different context, Lee et al. [6] have shown 
that in a multiprocessor it is preferable to use a small block 
size and to offset the penalty due to increased misses through 
processor prefetching. In such systems “blind” prefetching of 
instructions and data resulting from a larger block size is re- 
placed by “smart” prefetching performed in the processor and 
assisted by the compiler. 

The performance advantage which can be reaped from a 
lockup-free caclic depends on the average shared memory access 
time and on the dependencies within each instruction stream. 
A performance model including these two parameters is devel- 
oped and design trade-offs are discussed. 


2. MULTIPROCESSOR CACHE OBSTACLES 
2.1 Multiprocessor Cache Performance 
Even small cache miss rates can have detrimental effects on the 
throughput of high-speed processors. As the following evalua- 
tion demonstrates, the efficiency of the processor can go down 
rapidly when the memory access time is large or when the hit 
ratio is low. 

Let us assume a processor system with the following charac- 
teristics: 


X is the maximum throughput of the processor in MIPS (Mil- 
lion of Instructions Per Second) if all accesses can be re- 
solved by the cache (i.e., a cache hit rate of 1.0). X can 
often be easily estimated for a given processor architecture 
and instruction mix. | 


d is the average number of accesses per instruction execution, 
including instruction fetches, operand fetches and resultant 
stores. It is called the demand rate. 


Tm is the average time to resolve a miss via main memory. 
h is the average cache hit rate. 


t;o is the average time it takes to execute one instruction if all 
accesses are cache hits (i.e., ti9 = 1/X). 


If the processor blocks on every miss, the average time to exe- 
cute one instruction, ¢; is given by t; = t39 +(1—h)dTin. Hence, 
the average performance of the system, in MIPS, is reduced to 
Ne Xa aT Dividing by t,o and letting T° = ral Le 0s 
yields X’ = X spacers = XF,. F, is called the slowdown fac- 
tor. It varies between 0 and 1 and the closer it is to 1, the better 
the processor efficiencies. In Figure 1, the performance degra- 
dation of a cache-based system is shown for different values of 
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hand T°. As can be seen in Figures 1, even a system with 
a relatively high hit rate of 0.98 can suffer substantially, if the 
average main memory access time is high relative to the cache 
access time. Particularly in the case of fast processors, the ratio 
T° is likely to be in the upper ranges shown in Figure 1. 


In uniprocessors hit rates of 1 could not be obtained even 
if the cache size was infinite, because of the initial loading of 
instructions and data. After a context switch the cache expe- 
riences a cold start period when most of the blocks of the new 
process have to be reloaded [7, 8]. In multiprocessor caches, 
data must also be invalidated because of the modification of 
cached data by other processors. These invalidations reduce 
the cache hit rate as compared to the uniprocessor case [9, 
10]. Finally, it is well known [11], that a high average hit ratio 
hides wide variations in the hit ratio of individual programs. A 
truly general-purpose system should exhibit more uniform per- 
formance for different workloads. 


2.2 Multiprocessors vs. Uniprocessors 

The easiest way to build cache-based multiprocessors is to in- 
terconnect them with one or several buses. Since the buses 
provide a broadcast medium which automatically serializes all 
accesses, maintaining cache coherence is simplified. The prob- 
lem with using buses lies with their limited bandwidth. Even if 
processors have large private caches, a great deal of memory to 
cache communication is still necessary due to the need to prop- 
agate data updates (usually in the form of an (1) invalidation, 
(2) fetch block sequence). Frequent updates and consequent 
invalidations have two effects—they strain the bandwidth ca- 
pabilities of the bus and they lower the invalidated caches’ hit 
rates. While small block sizes can reduce bus traffic, there is 
the detrimental effect of lowering overall hit rates since caches 
cannot benefit from the spatial locality of code and data. The 
effect is particularly bad during cold start periods—after a con- 
text switch, for example—when the cache misses on a large 
number of consecutive accesses. 


The benefits of small block sizes can be reaped if processors 
are not constrained to wait for individual misses to be resolved 
before initiating another access. The use of lockup-free caches 
in multiprocessors is restricted, though, by the fact that logical 
problem can arise when accesses are performed out of program 
order. This may be the case in a lockup-free cache if, for ex- 
ample, a miss is followed by a hit, with the result that the 
access which hits is performed before the access which misses 
because of the longer time to resolve the miss. The restriction 
on the order in which accesses must be performed is due to the 
possibility of znter-process dependencies [12]. 


If a system makes no attempt to enforce all inter-process de- 
- pendencies, and the programmer and compiler are aware of this 
fact, then the out-of-program order execution of accesses by a 
single process is allowable. (Of course intra-process dependen- 
cies must still be preserved.) In such a system it is possible 
to design caches to be lockup-free. However, after having re- 
moved the possibility of using shared variables to implement 
synchronization, there must be an alternate method available 
to allow processes to synchronize. Special hardware recognized 
primitives, such as the test&set instruction, can be used to 
implement synchronization. 


2.3 Restrictions On Ordering 
With respect to the ordering of events within a multiproces- 
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sor, the user (or compiler) may expect the system to adhere to 
one of two logical models of behavior. In [12] these models are 
called the strongly ordered and the weakly ordered model of be- 
havior. In a strongly ordered system, processors must initiate 
memory accesses one-by-one, in program order [13]. Further- 
more, all processors must “observe” all other processors’ write 
operations in the same order. (By “observe” it is meant that an 
update becomes readable.) A system that is strongly ordered is 
sequentially consistent and can implement synchronization by 
software alone. 

A weakly ordered system assumes three types of shared data. 


1. Instructions!, private data, and non-writable data can be 
accessed and cached by all processors in any possible or- 
der. Since non-writable data are never modified, no inter- 
process dependencies can exist on such data. This is also 
true for private data which are only modfied and read by 
one processor. 


2. All other ordinary shared writable data can only be mod- 
ified in mutual exclusion. These are data used to transfer 
information from one process to another. 


3. Synchronization variables are data used to enforce mutual 
exclusion on write accesses to ordinary shared writable 
data. Synchronization variables are hardware recognizable 
as such. 


Data of type (1) pose no difficulty and will not be further dis- 
cussed. Data of type (2) are user/compiler generated. Accesses 
to such data must be protected by critical sections or semi- 
critical sections [10]. In the first case data may only be read or 
written by one processor at a time—the processor which has 
gained access to the appropriate critical section. In the second 
case, several processes are allowed to read the same data at the 
same time but updates must occur in a critical section. 

Accesses to data of type (3) are synchronization primitives 
such as test&set operations. Since data modified within a 
critical section cannot be read by another processor, while the 
modifying processor is still executing the critical section, the 
order in which the data are modified within the critical section 
is immaterial. The only constraint for correctness is that all up- 
dates have properly propagated when the critical section is ex- 
ited. However, a processor is not “aware” whether it is presently 
executing a critical section or not. Since critical sections may 
be nested or overlapped, keeping track of critical sections by the 
processor is not simple. For correctness, though, it is sufficient 
that all accesses of type (2) have propagated and completed, 
before an access of type (3) can complete. If a synchronization 
variable access is encountered, either a critical section is entered 
into or one is exited from—in either case all previous accesses 
must have been performed. 

We summarize this section by defining four properties that 
must be maintained for a weakly ordered system to remain cor- 
rect: 


P1: Intra-process dependencies must be observed at all times. 
Such dependencies must be treated as in any conventional 
uniprocessor. 
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We assume separation of instructions and data so that instructions can 
never be modified. 


P2: Memory coherence must be maintained at all times. This 
is true for all three data types. Any processor’s read oper- 
ation will always reflect the most recent write operation to 
the datum. (Memory coherence of type (2) data only has 
to be restored before the processor that modifies it exits 
the critical section. However, since it is easy, we assume 
that memory coherence is maintained at all times.) 


P3: All modifications of type (2) data must be performed from 
within critical sections. The compiler or the programmer 
must ensure that no two or more processes can be in the 


same critical section at the same time. 


P4: All pending misses must be resolved before an access to 
type (3) data can proceed, within a processor. 
3. 


LOCKUP-FREE CACHES IN WEAKLY OR- 
DERED SYSTEMS 

We limit our discussion here to weakly ordered bus-based sys- 
tems which adhere to property P2, decribed in the above Sec- 
tion (i.e., memory coherence is maintained at all times for all 
data). 


3.1 Basic Operation 

Most of the time the cache responds to processor requests of 
type (1) and type (2) data. With respect to such data, the op- 
eration of the cache is equivalent to the operation of a unipro- 
cessor lockup-free cache. Kroft described the implementation 


of such a cache in [3]. A basic overview of Kroft’s principle 


of operation is given here; for more details the original paper 


should be consulted. 

Multiple misses are resolved by storing information about 
each and then forwarding the miss request, packaged along with 
some vital return information, to the main memory. This is 
accomplished with the following in mind. 


e Local dependencies must be observed. 


e If a missed and to-be-returned block is to be allocated, 
space must be reserved in the cache for that block and, if 
necessary, a replacement must be made. 


e Miss requests must be tagged such that: 


1. The word of the block which caused the miss is known. 


2. The functional unit which the word is to be forwarded 
to is known. 


3. The slot in cache which is reserved for the block is 
known. 


Most of the above qualities are implemented by a set of asso- 
ciatively accessible registers, called MSHR registers (Miss In- 
formation/Status Holding Register), which keep track of the 
status of all pending misses. Returned blocks are buffered in a 
stack which can either be emptied, as contention allows, or can 
directly be accessed to speed up immediate demands. 


3.2 Multiprocessor Issues 

Multiprocessor caches must react in a specific way upon type 
(3) data accesses. An access to type (3) data means that a 
synchronization point has been encountered. To adhere to P3, 
a type (3) access must be disallowed until all pending accesses 
have been resolved. In Kroft’s type of architecture, this implies 
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that all MSHR registers must be empty before the synchroniza- 
tion access can proceed. Once the access has been performed, 
the cache continues to operate as usual, resolving misses and 
hits concurrently, until another synchronization is encountered. 


Type (3) data may be cached and are subject to coherence 
control as are all other data. The only constraint on type (3) 
data is that they are recognizable by the hardware. In modern 
processors this is done by accessing these data through special 
synchronization primitives, such as test&set operations. The 
processor can inform the cache at the time of the access that 
type (3) data are the target. A test&set instruction can be 
executed indivisibly at the cache if an ownership scheme [14] is 
used to implement cache coherence. 


4. A SAMPLE ARCHITECTURE 

Figure 2 depicts a sample architecture. The multiprocessor is 
bus-based, with N processors connected to M memory modules 
by one or several packet-switched buses. The programmer and 
the compiler assume that the system behaves in a weakly or- 
dered manner. The block size of the cache is one word (32 bits). 


4.1 Processors 

The specifics of the individual processor architectures are not 
important. For the system to be useful, processors should be 
able to prefetch operands and instructions as far ahead as pos- 
sible. Since the block size is one word, write operations can 
always proceed, whether or not the block is in the cache. That 
is, since a write redefines the value of an entire block, the 
block need not be fetched before modification. (How coherence 
is maintained on such write operations is discussed in the next 
Section.) However, this means write operations to bytes, or 
half-words must be compiled to preserve the correct outcome 
of the intended operation. We believe that this drawback is far 
outweighed by the fact that write operations always hit at the 
cache. 


4.2 Caches and Cache Coherence 

The complexity of the cache architecture depends on both the 
number of MSHR registers implemented and on the cache co- 
herence protocol used. The coherence protocol we suggest is a 
bus-based, “snoopy cache” protocol. This protocol works either 
for single or multiple packet-switched buses. In the absence of 
better evidence, we assume that the number of MSHR registers 
is four. 


The cache block size is one word. Each block can be in one 
of three states. Namely, RO (Read Only) to indicate that the 
block may be shared with other caches and may not be modified 
without broadcasting invalidations; RW (Read Write) indicates 
that the block is private and may be modified without delay; 
I (Invalid) indicates that the block is not valid (not in cache) 
and must be requested over the bus if it is to be read. 


The coherence protocol is very simple. A read hit may pro- 
ceed at any time if the block is in state RO or in state RW. If 
a read miss occurs on a block in state I, a request for the block 
is placed on the bus (assuming that an MSHR register is avail- 
able, otherwise the cache locks and the request will be posted 
after the first MSHR register becomes available). A write op- 
eration always appears to hit. If the block to be written is 
indeed in state RW it may be modified without further action. 
If the block is in-state RO it is modified, and an invalidation 
for the block is placed on the bus. The state of the block is 


changed from RO to RW. If the block is either initially invalid 
or not present it is immediately allocated and an invalidation 
is broadcast. The modification of a block without having pre- 
viously been fetched is allowed, since a single write operation 
always modifies the entire block. Note that it is not possible for 
two processors to update the same block concurrently because 
mutual exclusion prevents this from happening. 

Only in three cases (read miss, broadcast invalidations, and 
write back) is the bus accessed. In the first case, the read re- 
quest may either proceed to the memory for service, or the read 
request is interrupted by the cache which contains the requested 
block in state RW. The previous “owner” cache forwards the 
block to the requesting cache, and changes the block’s state 
from RW to RO. During the transfer of the block, memory is 
updated as well. The second type of bus activity (an invalida- 
tion) causes all caches that contain the block (either in states 
RO or RW) to invalidate their copies. Write-backs are only 
necessary if a to-be-replaced block is in state RW. Blocks in 
state RO can be overwritten. 

The cache performs four basic tasks: 


1. It responds to and services access requests from the pro- 
cessor. 


2. It monitors the buses for invalidations and accesses it must 


respond to. 


3. It receives returned blocks on which it previously missed, 
either from the memory or from another cache. 


4. It replaces cache blocks as necessary. 


4.2.1 Task 1: Processor Requests 

If a read request of type (1) or type (2) data hits at the cache, 
the word is immediately supplied to the processor. A read miss 
results in the following activity: 


e Check all MSHR registers whether the requested block is 
already in transit (the processor is in this case accessing the 
same word twice in a row): If this is the case it is allocated 
to another MSHR register along with the target register of 
the processor. If no MSHR register is available, the cache 
locks. When the block is returned, it is allocated in the 
cache, the MSHR register is deallocated, and the word is 
forwarded to the processor. 


e If there is no MSHR match, then the block is allocated 
in a reserved cache frame (a replacement is triggered if 
necessary) and a MSHR register stores the necessary return 
information. A reserved cache frame is marked as reserved. 


e If no MSHR register is available, the processor locks until 
one becomes available. 


For any write request, the block is written, whether or not it 
is present. If the block was not present or was in state RO, 
an invalidation with the address of the block is broadcast. No 
MSHR register is allocated to the access in this case. If the 
block was not present it may cause a replacement and a write 
back. The block will be in state RW. It is not possible for a 
block causing a write miss to be already present in an MSHR 
register, since this would imply an intra-processor dependency. 

If the request is to type (3) data, the cache locks until all 
MSHR. registers are empty (all pending misses have been re- 
solved). Then the access to type (3) data is resolved like any 
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other access. If it causes a miss, the cache remains locked until 
the access is completely resolved (i.e., the data are returned). 
Then the cache is unlocked and may proceed to resolve misses 
to type (1,2) data concurrently. 


4.2.2 Task 2: Bus Watching 

The buses are monitored continuously. If an invalidation of a 
present block is detected, the block is invalidated. It is not pos- 
sible for a bus invalidation to occur for a block presently being 
fetched (i.e., a block referenced in one of the MSHR registers) 
since this would violate property P3 of Section 2. If a read 
request for a block present in state RW is detected then the 
appropriate bus is interrupted and the block is placed on the 
bus. The block is forwarded to both the requesting cache and 
to main memory. The block remains allocated but the state is 


changed to RO. 


4.2.3 Task 3: Returned Blocks 

When a iniss is resolved, either by the memory or by another 
cache, the MSHR registers are checked to find out where the 
block is allocated and which processor register it is intended for. 
The block is placed in the reserved cache frame and forwarded 
to the processor. One more contingency must be taken care 
of. ‘The cache frame which was reserved for the miss may have 
been “replaced”*. This is handled by marking the appropriate 
MSHR register if a reserved block is replaced. In this case the 
word is only forwarded to the processor but does not get allo- 
cated at the cache. 


4.2.4 Task 4: Block Replacements 

A block is replaced when the cache frame it resides in is needed. 
Two flags, associated with the cache frame need to be checked. 
If the block is in state RW the block needs to be written back 
to main memory. Otherwise it may be overwritten. A block 
marked as reserved means that the word of a pending ‘miss is 
intended to reside in the frame. In this case, the cache frame 
may be used but only after checking the MSHR registers and 
flagging the appropriate MSHR register. The flag indicates that 
the returned block has lost its reserved cache frame and is only 
to be forwarded to the processor and not written to cache. 


4.3 Memory System 

The memory is interleaved into M modules. FIFO (First In 
First Out) memory buffers queue requests for each module. The 
coherence of the system is not affected by the buffers [12]. 

The architecture allows for virtual memory addressing. In 
this case, each processor has a TLB (Translation Lookaside 
Buffer) which caches the most recently performed virtual-to- 
physical address translations. The cache, however, must lock if 
a TLB miss occurs. The TLB miss may result in a page fault. 
If a page fault occurs, prefetching and writing memory words 
beyond the access causing the page fault must be prevented. 


5. ANALYSIS 

Two models are presented here. Both models are approximate, 
but useful information can be derived from them. We make the 
following assumptions in our model: 


1. A processor makes d inemory references per instruction. 
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It can not be invalidated, as mentioned before, but in the case of a 
direct mapped cache may have been overwritten by either another read 
miss or a write which mapped to the same cache frame. 


2. The distance between references with dependencies in the 
reference string of a process is fixed and is equal to l. 
Hence, after a miss, up to J references can be made be- 
fore the processor blocks and has to wait for the miss to be 
resolved. 


3. For each access the probability of a hit is h and the prob- 
ability of a miss is (1 — h). Successive accesses are inde- 
pendent. Therefore, the number of references between two 
consecutive misses is geometrically distributed with mean 
1/(1 —h). (Figure 3 illustrates the concept.) 


4. The memory access time is constant and equal to Tj,. 


5. The time to execute an instruction if all accesses hit in the 
cache is constant and equal to ¢;9. We associate a time of 
tio/d with each reference. 


6. Effects of synchronization and of TLB misses are neglected. 


There are several approximations in the model. First of all, 
in a practical system the dependency distance is usually vari- 
able and the memory access time is random because of memory 
conflicts. These two approximations were made here to facili- 
tate the solution of the model. Second, successive accesses to 
the cache are correlated; however, the hypothesis of indepen- 
dent accesses is often made in cache models (see for example 
[15]) and is as good as any other hypotheses in the absence of 
real program traces. Overall, we feel that the models include 
the most important parameters affecting the performance of 
the lockup-free caches and should give indications as to system 
configurations for which the complexity of lockup-free caches is 
warranted. . 


5.1 Model 1: One MSHR Register 
It is assumed that only one MSHR register exists in the cache. 
A single miss does not cause the cache to lock immediately. The 
cache will block either if a second miss occurs, or if a depen- 
dency with the reference of the first miss prohibits any further 
prefetching. We have to consider two different cases. In the 
first case, the memory access time is larger than the time during 
which the processor can continue prefetching without encoun- 
tering a dependency with a previous miss. That is, ae Ses ie 
In the second case, we assume the opposite, that is Hot Pal dean 
Case 1: When a miss is encountered, it is immediately for- 
warded to the memory. The number of references which can be 
overlapped while the miss is being resolved is a4. This number is 
governed by the probability that a miss occurs during the next 
l accesses before the processor blocks due to a dependency with 
an access that missed (Figure 3). Hence @ is given by: 


a@=(1—h)+2h(1—h)4+ 3h7(1-—h)+--- 
Lipide hahaha 
which can be reduced to: 


Ef! 
1-—h 


éE= 


If Ty is the time to execute N instructions then 


1-A! 


Ty = N(t,o + (1 ~h)dTm) - N— 
—h 


(1 = h)tio 
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The time to execute one instruction is: 
tzoh! + (1 — h)dT yn 


Hence the slowdown factor is: 


1 
hi +(1—h)dT® 


Case 2: In this case the amount of overlap of hits is only 
limited by the probability that a second miss occurs before the 
first miss is resolved. This case may result if the code is re- 
structured by the compiler such that the distance between de- 
pendencies which cause the processor to block is very large. Let 
N be the number of references between two successive inisses. 
N = 1/(1—h) and the average time to execute between two 
misses becomes 


In 


II 


S> es fpiTm le poetmti Ae ae 
k=1 


tio hatin 
| ay 
This result comes from the fact that if the inter-miss distance 
N is less than dT, the cache blocks the processor for a time 


m?) 
Nt; . e e * « 
Tm — —z*. The average time per instruction is given by: 


jet 
dT, + tio 1—h 
ae ee 
1—h 
The slowdown factor for this case is: 


1 
héTm + (1 —h)dT 
5.2 Model 2: Infinite Number of MSHR Registers 
In this model, it is assumed that the number of MSHR registers 
is infinite. As for the previous model, we consider two cases. 

Case 1: In this case we assume t;9l/d < Tm. After the first | 
miss, at reference 1/(1—h), d references can be generated by the | 
processor while the first miss is resolved by the memory system 
in time T,,. Therefore, a processing time of t; 9//d can be over- 
lapped with the first miss. If any one of these I references causes 
a miss, it can be serviced immediately because there are an infi- 
nite number of buffers. The misses occurring for the | references 
do not cause additional blocking due to dependencies because 
the processor has to wait until the first miss is resolved before 
initiating new references. When the first miss is completed, the 
processor can continue execution, and because of the geometric 
distribution assumption, the next miss occurs 1/(1 — h) refer- 
ences later, on the average. The sequence of events after the 
first miss is repeated. 

Therefore, in a time 

bj 


—_ 


0 | 
d 1 + Tn 


—h 


: 1 
a number of ;-; + / references are performed, corresponding to 


( 1 +1) 
1l—-h d 


instructions. 
The average time per instruction is 


tio ae Tin A(1 = h) 
1+i(1—h) 


and the slowdown factor is 


1+1(1—h) 
1+T7°d(1—h) 


Case 2: In this case we assume that t;ol/d > T,. When a 
reference has a dependency with a previous access that missed, 
the miss has had the time to complete and therefore, the de- 
pendencies do not block the processor. We have achieved total 
overlap of miss handling and the slowdown factor reaches its 
maximum value of 1. 


6. DISCUSSION 

6.1 Performance Interpretations 

Figure 4 shows the ratios of MIPS rates of a system with one 
MSHR buffer per cache and of a system with locking caches. 
The improvement is greatest for low hit rates and memory ac- 
cess times in the range of T° < I/d. This is due to the fact 
that if [2 > 1/d, most misses will cause some blocking, because 
before a miss can be resolved a dependency will occur with the 
reference that misses. If T° < I/d, however, the cache will only 
lock if a second miss occurs during the service of the first miss; 
some misses will be totally overlapped with other references and 
the amount of overlap also increases with T°. 

For larger hit rates, such as h=0.98, the improvement due to 
the lockup-free cache with only one MSHR is low because misses 
are rare and the only savings per miss, with respect to a locking 
cache, are [ references. These results confirm Kroft’s results 
for the average access time as a function of the the number 
of MSHR registers. Kroft’s results (derived from a prototype) 
indicate very poor performance for a cache with a single MSHR 
register and very good performance for a cache with up to four 
MSHR registers. Kroft further states that very little is to be 
gained by using more than four registers. 

Figure 5 shows the ratio of MIPS rates of a system with an in- 
finite number of MSHR buffers and a system with locking caches 
for different hit rates, as a function of T°. The improvement of 
performance is highest for a low hit rate of h=0.8. This is to 
be expected, since a high miss ratio results in multiple misses 
in a streak of 1 consecutive references and these misses can be 
overlapped. In the ideal case, for example, / misses occur se- 
quentially, until the system blocks due to a dependency with 
the first miss (we assume J/d < T°). The misses are resolved 
one after another and the total penalty paid for the sequence of 
1 misses is only T°. Hence, the system benefits from frequent 
misses. 

For [/d > T°, the system with an infinite number of buffers 
operates at maximum speed, with a slowdown factor of 1. In 
this case no miss ever causes a penalty, since the system does 
not block on multiple misses and all misses are always resolved 
before a dependency can occur. For higher hit rates the proba- 
bility of overlapping misses decreases and performance ratio de- 


generates to the case of the cache with only one MSHR buffer. 


The number of buffers used, on average, is an interesting pa- 
rameter. It is possible to estimate the average number of busy 
buffers in the cache as follows. Let t; be the average time per 
instruction. The average time to execute N instructions is Nt; 


123 


and a total of Nd(1—h) misses must be processed, requiring a 
total service time of Nd(1—h)T,, from the miss buffers. There- 
fore the average number of busy buffers is 


d(l —h)Tin 
t; 


d(1 — Ayre 


= To dire dl — hyFo (case 1) 


[1+ i1—h)] 


or =d(1—h)T? (case 2) 


This value is upper-bounded. by 1+U1—h). For all of the 
examples of Figure 5, the average number of busy buffers is less 
than 2.75. 


6.2 Consequences 
For an otherwise efficient lockup-free cache to be of consequen- 
tial benefit, multiple MSHR registers must be implemented. 
Only in the case when the hit rate is low, and Ty, < 1/d does a 
single MSHR buffer offer worthwhile improvement. 

A cache with a number of MSHR buffers can, however, be 
very useful. Such a cache can offer substantial improvement 
over a locking cache when the hit rate is not very high. This 


fact can benefit systems in three particular circumstances. 


1. Any system with a low hit rate benefits from lockup-free 
caches consistently. In our sample architecture, two ben- 
efits are derived from the fact that the cache block size is 
one word. Namely, the bus traffic is minimized and write 
operations always hit at the cache. The drawback of the 
small block size is, however, a lowered hit rate. The system 
can accommodate this lower hit rate because a lockup-free 
cache is used. 


2. Context switches always cause very low transitory hit rates. 
A. lockup-free cache can help “smoothen” the performance 
dip in the system behavior after such context switches. 


3. A cache which usually exhibits a good hit rate, may be 
sensitive to particular “pathological” workloads which can 
lower the hit rate for particular applications, or, espe- 
cially, for operating system calls. As in the case of context 
switches, the cache can adapt to such workload changes if 
it is lockup-free. 


It is interesting to note, that a system with single word blocks, 
while likely to lower the hit rate, also experiences a favorable 
shift in the distribution of misses. Misses are much more likely 
to appear in bursts, since an initial sequential access to an array 
or other data structures will cause a miss for every access. This 
is also the case when double- or quad-words are accessed. A 
system with lockup-free caches, will exhibit better performance 
if misses are clustered together than if they are homogeneously 
distributed. This characteristic can be taken advantage of if the 
compiler can generate load instructions for data to be used in 
the future, ahead of time. In this case the “blind” prefetching 
associated with larger block sizes, has been replaced with selec- 
tive “smart” prefetching under compiler control. 


6.3 Dependency Effects 

Figure 6 shows the MIPS ratio of the infinite buffer system 
and the locking cache system as a function of I. For this case, 
T°, = 20 and d = 1.5. As is to be expected the performance 


ratio increases linearly until //d = T° up to a point where the 
lockup-free cache system operates at peak speed and remains 


/ 


constant. In the case of a hit rate h = 0.8, the maximum per- 
formance improvement over the locking cache system is 700%. 
Since the inter-dependency distance I affects the performance 
of the system greatly, techniques on how to increase | are im- 
portant. : 

(1) The compiler can attempt to increase / within the in- 
struction stream by reordering instructions in a more favorable 
way. 

(2) All load instructions should be non-blocking; as they are 
implemented in the IBM RT processor [16]. 

(3) The compiler can generate special instructions which ex- 
plicitly load data into cache, long before they are needed. Such 
non-blocking load operations enable the compiler to control se- 
lective prefetching of data. These loads should be generated in 
bursts. 

(4) For instruction access misses not to cause blocking, several 
consecutive instructions can be prefetched ahead of instruction 
decode time. Only branch instructions will cause blocking in 
this case. If branches are delayed as in RISC processors, / can 
be increased. 

(5) Some architectures can naturally exhibit a high value for 
1. For exainple, if a vector processor is attached to the system, 
long strings of vector register load instructions are likely to be 
executed frequently. 


7. CONCLUSION 
In this paper we have shown how a multiprocessor can be con- 
figured with lockup-free caches. Vital to such a system are three 
key concepts: 

1. The correctness of the system. 

2. The efficiency of the interconnect. 

3. The efficiency of the cache architecture. 


We have shown that the processors of a multiprocessor sys-. 


tem may resolve multiple misses concurrently if the system is 
weakly ordered. Weakly ordered multiprocessors require that 
shared writable data are modified exclusively from within crit- 
ical sections. This restriction can be enforced by the compiler. 

The interconnect we propose consists of packet-switched 
buses. By using a cache block size of one word, the bus traffic 
is minimized. Hence, more processors can be connected to the 
buses and contention is lower. The fact that a one word block 
size decreases the cache hit rate is overcome by the fact that 
the caches are lockup-free. 

We have shown that ina weakly ordered system, maintaining 
cache coherence is very simple, when the cache block size is 
one word. The assumption of a one word cache block size, 
eliminates all cache write misses and minimizes bus traffic since 
prefetching is controlled by the compiler and is not done blindly. 

Overall, we believe that bus-based multiprocessors with 
lockup-free caches are both viable and useful. One of the most 
interesting features of such a system is the adaptability to the 
cache miss rate it exhibits. When hit rates are high, the im- 
provement due to overlapping misses is low. However, when 
the hit rate declines, the efficiency of the lockup-free caches im- 
proves rapidly. This characteristic makes lockup-free caches a 
particularly appealing feature for systems with a large variety 
of types of workloads, and hence varying hit rates. 
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POET: A Tool for the Analysis of the Performance of Parallel Algorithms. 


Anselmo A. Lastra and C. Frank Starmer 
Departments of Computer Science and Medicine 


Duke University 
Durham, NC 27710 


Abstract 


A tool to aid in the analysis of the execution time of 
parallel algorithms is presented. The tool consists of a 
simple language for describing the algorithms and an 
interpreter that determines the execution time on a 
given number of processors. 


The key concept is the separate specification of local 
computation the 
interpreter to 
the communications target 
An the 
access delays coupled with the specified amount of 
local computation 


and remote access in 


This allows the 
of the 
simulation of 


memory 
algorithm description. 


simulate parallel 


machine. accurate memory 


results in the predicted parallel 


execution time. Operation of the system has been 


validated by comparing predicted versus 
execution times for on 
Butterfly and Butterfly Plus Parallel Processors. 


observed 


numerical algorithms the 


Introduction 


The 
algorithms 


of the 
ubiquitous 


of 
science. 


exploration time performance 
in 
the novice 
deciding upon a sorting algorithm 
scientist the theoretical complexity of 
algorithms. The tool described in the paper is aimed at 
an investigator between It 


designed for speedy performance 


is computer 


Practitioners range from programmer 


to the computer 
investigating 
is 


these two extremes. 


the practical and 


analysis of algorithms for shared memory parallel 


architectures. 


There have been parallel performance analysis tools 
described the literature. In particular, those for 
analyzing the speedup of FORTRAN code®, and as part 
of a program development environment?, as well as 


in 


re te OE TE SN NS RTL LOLS TS SL ST ce Se oA Se 


i Supported in part by grant HL-32994 from the National Institutes of 
Health and contract ONR-4414804 from the Office of Naval Research 
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and probabilistic? models of parallel 
performance. These, however, don't address the needs 
of someone trying to decide on a particular algorithm 
or the suitability of a 
architecture. Presumably at that stage programs have 
not been written and may never be. What is needed is 
that 
description of an algorithm. This paper describes 
such a tool. POET (Prediction of Execution Times) is a 
system to facilitate execution the 


theoretical® 


investigating particular 


a tool analyzes performance given a simple 


time analysis of 


parallel implementations of algorithms. POET consists 


of a 


language for specifying algorithms and an 
interpreter for the language that predicts the time 
performance of the algorithm on an MIMD shared 


memory parallel machine. 


On a single processor, a good way to predict the 


execution time of a section of code is to "count 


operations". In other words, combine an analysis of 
the flow of control with an estimate of the run time of 
small sections of code, such as_ inner 
This 
expression, or perhaps just a value, for the execution 


time of the algorithm. 


loops or 


individual operations. yields an analytical 


This method fails on a parallel machine because of 
interactions between processors. On a shared memory 
machine the execution time of code on a particular 
processor is affected not only by memory access to 
remote memories, but also, on some architectures, by 
accesses of the local memory by other processors. 
Other that affect 


execution time are cache utilization or bus saturation. 


memory related phenomena 


Since the factors that perturb execution time are 
memory related, it is reasonable to assume that by 
separating purely local computation from remote 
memory access, one could use a modified version of 
the method that worked for the uniprocessor. This is 
what POET accomplishes. The algorithm specification 
language allows for the separate specification of local 
computations from remote fetches The 


and _s stores. 


interpreter then uses this information to simulate the 
memory _ transfer of the 
The estimated times of local computations 
combined with the delays the memory 
access patterns then constitute the total run time. 


behavior target parallel 
machine. 


incurred by 


Currently simulation modules for the Butterfly! and 
Butterfly Plus2 
implemented 


parallel processors have been 
validated. We that 
implementations simulating other shared memory 
MIMD machines, the RP34, 


modifications of only one module. 


and believe 


such as will require 


The Language 

To determine flow of control, the simulator interprets 
a language with a syntax similar to that of C. New 
statements were added to describe the parallelism and 
memory transfers. Only a restricted subset of the data 
types and expressions in C were retained, mainly to 
keep the the more 
manageable. An example ‘program is shown in Figure 


implementation of interpreter 


1. Note that whenever "processor" is mentioned, it 
denotes a simulated processor, not one on which POET 
is running. 


There are two statements for specifying the use of 


time by the algorithm, compute and transfer. The 
compute statement specifies an amount of purely local 
computation. It has one argument which is an 


expression that, when evaluated, yields the time that 
the computation adds to the clock. For example, for 
matrix multiplication an individual computation may 
consist of the sum of the execution times of an add and 
a multiply. the 
estimated time, 
instruction execution times from the manufacturer of 


the hardware, or perhaps time from a benchmark. 


The execution time specified for 


compute statement may be an 


The transfer 
operation 


statement store or fetch 
There 
specifies 
accessed. The 
argument that a memory access 
adds to the local machine and the third the time 


consumed by the remote processor (due to one or 


represents a 
to a 
The 


memory 


remote memory. three 


which 
second 


are 


arguments. first argument 
to be 


specifies the 


remote is 


time 


more lost memory cycles perhaps). These times are 
variables because one may want to specify different 
types of transfers, such as an integer, double 


precision floating point, or maybe a block of bytes. If 
simulating single memory architectures, the first and 
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for(i = 07; i < nz i = i + procs) 
for (7-= 03: F< ne. FH) 4 
for(k = 0; k < n; k++) { 
transfer(meml, FetchDouble, RemoteFetchDouble) ; 
FetchDouble, RemoteFetchDouble) ; 
compute (DoubleAdd + DoubleMult); 
} 


transfer(mem3, StoreDouble, 


transfer (mem2, 


RemoteStoreDouble); 


Figure 1. Simplified code fragment for matrix multiply. The two 
matrices of size n are on memories mem1 and mem2 with the result 
going to mem3. DoubleAdd and DoubleMult are the execution times 
of an add and a multiply, respectively. FetchDouble is the time 
consumed by the local processor for fetching a double precision 
floating point number while RemoteFetchDouble is the time that 
the remote processor is delayed because of the fetch. StoreDouble 
and RemoteStoreDouble are analogous times for storing a number. 
Procs is the number of processors. This simple model does not 
include the overhead of index or pointer calculations. 


third arguments, the remote memory number and 
time penalty on remote processor, are not applicable. 
Times for data transfers are obtained from the 


manufacturer's specifications or from benchmarks. 


The parallel and wait statements control the parallel 
execution of the simulator. They are used to begin 
simulated execution on the user specified number of 


processors, to implement critical regions, to 


synchronize, and to return execution to one 


The wait 
it into an idle state until 


processor. statement places the processor 
the expression 
Wait is a 


of 


executing 
specified as an argument becomes true. 
construct for synchronizing 


useful groups 


processors. 


The 
to begin 


is on one _ processor. 

the 
simulated execution on the number of processors, 
different 
Upon’ reaching 


Initial simulated execution 


parallel statement causes simulator 


or 
perhaps just processes, specified as an 
the end of 
block, simulated processors are set to an idle state. 
When all the block, 


processing is continued on processor number 0, the 


argument. the parallel 


processors have completed 


processor on which original execution began. 


All variables in the POET language must be declared. 
There are two main classes of variables, local and 
shared. An individual copy of each of the local 
variables is made for each of the simulated processors 
while there exists only one copy of each of the global 


variables. 


Other statements in the POET language are identical to 
those in C. They include statements for flow of control 


such as if and for. The more commonly used 


arithmetic and logical operators have been 
implemented. 

The Simulator 

The simulator loads a program, in the language 


described above, that specifies the flow of execution to 
be modelled. The program is parsed and converted 
into The 
language restricted 
instructions. the for 
statement, 


an intermediate intermediate 
subset of 
Statement is 


two arithmetic 


language. 
consists of a 


For example, 


if 


goto 


broken up into an 


expressions, and a In a manner 
similar to an assembly language without the word size 
reStrictions. restricted set of 
instructions made the interpreter easier to design and 


modify. 


Statement. 


Having such a 


The interpreter works on one simulated processor at a 
time and continues working on that processor until 
an instruction reached that 
time or affects another processor. 


iS consumes execution 
The processor on 
which to work is determined by a scheduler described 
below. State information is kept for each of the 
simulated processors. This includes the contents of all 


of the local the 


current transferring data, 


variables, and 


a program counter, 
processor state such 


computing, idle, and waiting. 


as 


A clock is also kept for each simulated processor. A 
processor's clock is incremented when a compute or 
instruction is scheduler 


transfer interpreted. The 


uses these clocks to determine which processor to 
simulate next. The process to schedule next is the one 
that has the smallest value on its execution clock. If 
there is more than one clock with the same minimum 
they are scheduled for the 
This 
should yield an execution trace very close to that of 
the processors executing in parallel. A global clock is 
kept which shows how far the simulation has come. 


This clock eventually yields the overall run time. 


time, interpreter in a 


round robbin fashion. scheduling algorithm 


The key to the simulator is the module in the 


interpreter that mimics the processor to memory 
communication of the target machine. For the 
current target machine, the BBN Butterfly, it is 
particularly easy to implement. When a _ processor 


requests access to another memory (the memory of 
another processor) with a the 


simulator examines the state of the remote processor. 


transfer instruction, 
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If the 
computation (a 


local 
statement) or is idle, the 


remote processor is executing a 
compute 
simulated transfer occurs immediately. The remote 
processor is penalized for lost memory cycles and the 
transfer time is added to the execution time of the 
local processor. Both of these times are specified as 
arguments to the transfer instruction. 

If the remote processor is busy transferring data, a 
the 


transfer or transfers. 


retry would after remote 


the 
instead of executing a retry, a queue is kept of data 


occur processor 


completes In the simulator 
requests for each processor. This queue is examined 
The 
simulated 


transfer. 
the 


the processor completes a data 


blocks 


when 
requesting processor until 
transfer completes. 

Note that we do not model individual paths through 
the Butterfly switch, we assume that one is available. 
This turns out to be a good simplifying assumption 
because the switches on the Butterfly do not: block 
when a transfer fails)’. Rather, the path is released 
and the transfer is retried a short random time in the 
future. Another reason that 
be that 
configurations have alternate paths between nodes. 


a particular path need 


not simulated is many Butterfly processor 


It is true that a high volume of data transfers would 
and this 
inaccurate 


Saturation 
of POET 
when simulating extreme volumes 


produce’ path cause 


implementation to produce 


timings. However, 
of 


deemed necessary since the degree of parallelism 


interprocessor transfers, high accuracy was not 
would have presumably reached a plateau. This model 


trades a slightly limited range for simplicity. 


The memory access simulator is the only one which 
varies with target architectures.. Simulators for some 
architectures would be more difficult to implement. 
For example, on an architecture with a single memory 
and a large cache for each processor, a model of the 
Steady state distribution of cache hits would have to be 
developed. Also a characterization of the delay on the 
global bus would be important. 

for 


The information 


individual processors and can output the results in a 


simulator maintains timing 
variety of ways. Control is also provided over the 
Optional 


to 


number of processors in a_ simulation. 


tracing is provided as an aid in debugging or 


visualize bottlenecks in the algorithms being studied. 


Performance of the BBN Butterfly version of POET was 
validated by comparing predicted execution 
generated by POET with measured execution times of 
numerical algorithms. 
both a Butterfly Parallel Processor, 
faster, Butterfly Plus. 


times 


The testing was conducted on 


and the newer, 


The algorithms used are representative of those used 
for They included matrix 
manipulation algorithms, and those for the solution of 


scientific calculations. 


ordinary differential equations. 
Agreement between’ predicted and measured 
execution times was generally excellent. The only 


exceptions were algorithms designed to cause the path 
saturation effect described above. One memory was 
accessed repeatedly by all of the processors. When 
this occurred, the simplified Butterfly data transfer 
model of this version of POET resulted in optimistic 
execution times. For our purposes this was acceptable 
since these were artificially poor algorithms designed 
for testing POET. If more accurate results were desired 
under these conditions, a more complex data transfer 
model would be necessary. 


Conclusions 


We believe that there are many advantages to this 


technique for exploring algorithm performance. 


First, it is easy to specify the algorithms. Working 
code does not have to be produced before estimates 
can be made of the execution time. Many different 
algorithms may be tested experimentally in the time it 
would take to code and debug only one. Second, the 
simulator runs 
it is likely to take much longer than 
testing it on POET. Third, the parallel hardware does 
not have to be available. One may explore ideas when 
the parallel machine is busy, or down, or even while 


awaiting delivery. 


quickly. Even if code is available, 


benchmarking 


Another use for POET might be in testing ideas for 
parallel machines. By varying the times specified for 
computation and for data transfer, one can predict the 
increase or decrease in the degree of parallelism for a 
given change in the speed of the processor, memory, 
switch. With modifications the 
module of POET, the performance of 
experimental network topologies may be studied. 


Or interprocessor to 


interconnect 


129 


An interesting addition to POET would be a graphical 


display of simulated execution. This would aid in 

detecting bottlenecks and in understanding the 

parallel execution of algorithms. Another, more 

difficult, enhancement would be automatic 

benchmarking of existing code. 
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Abstract 


The memory latency in a shared memory mul- 
tiprocessor system can be reduced by either the use of 
a high bandwidth interconnection network or the 
incorporation of private cache memories. This paper 
presents the performance analysis of a system that 
employs a high bandwidth multiple-bus network and 
private cache memories. The cache coherence proto- 
col is a modified version of the Write-once protocol 
proposed for single bus architecture and _ the 
multiple-bus network is asynchronous packet 
s\.itched. A queueing network model consisting of 
mixed multiple class customers has been developed. 
The model captures the effects of both multiple-bus 
contention and the cache coherence protocol on the 
system performance. To reduce the computational 
complexity of the model, a simplified algorithm based 
on flow equivalence technique has been developed. 
Numerical results obtained from our model show that 
a high bandwidth network such as multiple-bus is 
necessary for a large system because the single bus 


gets saturated very rapidly and creates system 
bottleneck. 


1. Introduction 


The problem of memory latency has been con- 
sidered as a major obstacle in the evolution of 
shared-memory multiprocessor systems. Extensive 
studies aiming at reducing the memory latency have 
been carried out in the past. There are two basic 
ways of dealing with this memory latency problem: 1) 
design of a more cost-effective interconnection net- 
work that offers high communication bandwidth [2]; 
2) use of private cache memory to reduce memory 
access time and memory bandwidth requirement [10]. 


A great deal of work has been done in design 
and analysis of various interconnection networks such 
as crossbar, multistage interconnection networks, and 
multiple-buses. Compared to crossbar and multistage 
interconnection networks, multiple-bus interconnec- 
tion provides several advantageous features such as 
flexibility, expandability and fault tolerance [3, . 8, 
16]. One can configure the multiprocessor system in 
a variety of ways to provide a range of trade-offs 
among bandwidth, connection cost and reliability. 
However, as was pointed out by Winsor and Mudge 
[15], the cache based multiple-bus multiprocessor may 
suffer from difficult synchronization problems. Hence, 
asynchronous multiple-bus systems seem to be more 
attractive for large size cache based multiprocessors. 


*This research is supported by NSF Grant #DMC-8513041 


and a grant from Louisiana Board of Regents. 
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In [17], queueing network models have been developed 
to analyze the performance of asynchronous, packet- 
switched multiple-bus system with buffers. It has 
been shown [17] that a packet switched multiple-bus 
system provides high bandwidth and flexibility. None 
of the previous studies on multiple-bus system takes 
into account the incorporation of private cache 
memories, except for [5] where some theoretical 
bounds for the system throughput are developed. 
The use of private cache memories in a multiproces- 
sor system introduces the complex cache coherence 
problem [4, 10] because multiple copies of a memory 
block may reside in different caches at any given 
time. Modification of any copy of a shared memory 
block by a processor in its local cache may cause an 
obsolete value of the shared data in the main memory 
and other caches that are currently having a copy of 
this block. The avoidance of this cache inconsistency 
problem is vital to the correct operation of the 
shared-memory multiprocessors. 


A number of cache coherence protocols have 
been proposed in the literature recently, which can be 
broadly classified into two groups. The first group 
employs a centralized global directory scheme while 
allowing a general interconnection network to be used 
[13]. The second group allows the cache consistency 
to be maintained in a decentralized manner but limit- 
ing the interconnection to be only single shared bus 
(1, 6]. The write-once protocol, proposed by Good- 
man|(6], was the first distributed protocol to appear 
that provides a good compromise between write- 
through and write-back protocols which are used in 
commercial machines ele However, both of these two 
classes’ of protocols have potential problems. The 
central directory protocols allow a general IN to be 
used but the performance of the system is highly 
dependent on the central directory. The distributed 
protocol, on the other hand, tries to remove this 
bottleneck, but instead it is passed from the central 
directory to the single shared bus. As has been 
shown in [1] and [14], the single bus creates system 
bottleneck when the number of processors exceeds 10. 
Hence, it seems to be necessary to develop cache 
coherence protocols that allow high bandwidth inter- 
connection network to be used while keeping the 
advantage of distributed control. In this paper, we 
eonsider a cache coherence protocol based on 
Goodman’s Write-once protocol that can be applied 
to high bandwidth packet switched multiple-bus sys- 
tems [17]. 

Archibald and Baer have presented a comprehen- 
sive performance comparison of various single bus 
protocols by means of simulation [1]. An analytical 
model for the single bus protocol based on write-once 


as well as its variants has been reported by Vernon 
and Holliday in their paper [14]. Their model is exact 
and is based on Generalized Timed Petri Nets tech- 
nique. However, it can not be easily extended to 
large system sizes because of the complexity. In this 
paper, a queueing network for cache based asynchro- 
nous multiple-bus multiprocessors will be developed 
that consists of both open and closed customer classes 


[7]. The effects of both multiple-bus contention and _ 


the cache protocol on the system performance will be 
studied. The model can be solved by using standard 
MVA algorithm and flow. equivalence technique[7] to 
obtain performance values for a variety of system 
parameters with reasonable computational cost. 


In the following section, we will give a brief 
description of the system organization and the cache 
protocol for the proposed system. The assumptions 
that are used in our analysis and the queueing net- 
work model of the cache based multiple-bus multipro- 
cessor are presented in Section 3. Section 4 discusses 
the performance results and Section 5 presents the 
conclusions. 


2. The System Organization and Operational 
Characteristics. 


Fig.1 illustrates the cache-based multiple-bus 
multiprocessor configuration. Associated with each 
processor is a private cache memory through which 
all memory accesses pass. A set of B_ packet 
switched buses connect all the N caches with shared 
main memory which is also divided into M_ inter- 
leaved modules. The communication between the 
caches and between a cache and the main memory is 
performed through system buses. A requesting cache 
(called master) which issues a memory access request 
releases the bus immediately after the request packet 
containing the memory address, master id., and 
desired operation(read/write), etc. has been sent to 
the slave device. The released bus can be used for 
other purpose while the desired operation in the slave 
(memory controller or cache controller) is in progress. 
After the operation is finished, the slave device acts 
as a pseudomaster to packetize the response data and 
sends it back to the requesting cache through a sys- 
tem bus. As a result of this, the system buses can be 
well utilized and high system throughput can be 
expected. However, buffers that temporarily hold 
incoming and outgoing packets from each device are 
necessary. The packet transmission can be done on 
any one of the system buses as determined by the 
arbiter. The cache and memory controllers must be 
able to receive more than one packets simultaneously 
from the buses, otherwise a packet may be lost. 


The timing of the system buses is asynchronous 
in the sense that there is no centralized global clock 
that distributes clock signals to all the devices in the 
system. The data transfer on a bus is done by means 
of interlock handshaking. However, the internal cycle 
times of all processors and caches are assumed to be 
the same and this cycle time constitutes the basic 
time unit in our foregoing analysis. In 118) we have 
presented a somewhat detailed design of the internal 
organization of each private cache as well as the 
cache protocol for synchronous packet switched bus 
systems. It has been shown in [18] that the proposed 
system satisfies the sequential consistency require- 
ment [12]. For asynchronous buses, concerned in this 
paper, the design of cache organization and protocol 
should be similar except for different implementations 
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at the circuit level. For the purpose of completeness, 
a brief review of the cache protocol follows. 


The state of a memory block viewed by a partic- 
ular cache can be one of the following: 0), not 
present, 1), valid, 2), written-once, 3), dirty, and A), 
invalid (see Figure 2). When a processor read results 
in a cache hit, the cache controller will supply the 
requested word to the processor without changing the 
state of the cache block containing the word. On a 
read miss, the cache controller locates the place where 
the requested memory block resides, flushes a cache 
block to make room for the incoming block, and loads 
the block through one of the system buses. The 
memory block is read from one of the main memory 
modules provided that none of the caches has a dirty 
copy of the memory block. Otherwise, the memory 
block is loaded from the cache that has a correspond- 
ing cache block with dirty state. Once the block is 
loaded, the state is set valid. When the cache con- 
troller receives a write request from the processor, it 
first checks the state of the cache block into which 
the write is to be performed. If the state of the block 
is dirty or written-once, the write can proceed 
without any delay except for setting the state to be 
dirty. If the state of the block is valid, the cache 
controller has to acquire one of the system buses to 
broadcast an invalidation signal to other caches and 
write through the word to be modified into main 
memory before the write operation can be per- 
formed. Upon a write miss, the cache controller per- 
forms the same operation as for a read miss except 
that the requested memory block is loaded with dirty 
state and other copies in the system are invalidated. 


There are B snooping controllers associated with 
each private cache. Each of the B snooping controll- 
ers monitors one of the system buses for read and 
write from other caches. There are basically four 
types of bus transactions: Shared Read (SR), Dirty 
Read (DR), Write Invalidation (WI), and Write Back 
(WB). An SR transaction is due to a read miss 
request issued by a cache. Upon a write miss, a cache 
controller will generate a DR (since the requested 
block is loaded with dirty state) request. A WI 
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Figure 1. A cache based multiple-bus multiprocessor. 


transaction is started by a cache that serves its 
processor’s write request into a valid cache block. If 
a block to be flushed (in order to make room for a 
incoming block) is in dirty state, the write back of 
this block is necessary. The WB transaction 1s 
caused by a cache that writes back a replaced block. 
If a snooping controller detects an SR or DR, it first 
determines whether its own cache has a dirty copy of 
the block requested on the bus. If so, the snooping 
controller must inhibit the main memory from 
responding to the bus read and provide the data to 
the requesting cache. The local copy of the block is 
changed to invalid in case of DR. In case of SR, the 
block is written back to the main memory and the 
state is changed to valid. If the local copy of the 
block is in a state other than dirty, the snooping con- 
troller invalidates the copy upon a DR request and 
sets state to valid upon an SR request. The snooping 
controller invalidates the corresponding cache block 
in its local cache when a WI bus transaction is 
detected. During each of these cache operations 
requested by snooping controllers, requests from the 
processor are suspended. 


Due to the multiple buses, directly applying the 
above protocol to the multiple-bus system may result 
in race and hazard conditions. In [18], we have 
defined 5 types of hazard conditions, namely DR-SR 
hazard, DR-DR hazard, DR-WI hazard, SR-WI 
hazard and WI-WI hazard. An occurrence of any 
hazard condition described above may cause a pro- 
gram error. These problems can be resolved by using 
the same technique as described in [18]. 


3. Queueing Analysis 


3.1. Assumptions and The Work Load Model 


In the analysis presented in this section, we shall 
assume homogeneity of processors, memory modules 
and buses. That is, all processors in the system are 
considered to be behaviorally identical, so are the 
memory modules and buses. After an amount of 
internal computation, called thinking time, a proces- 
sor generates a memory request which is sent to Its 
private cache. The thinking time of each processor Is 
a random variable, depending on the type of the 
instruction, and is assumed to be arbitrarily distri- 
buted with a mean of Z cycles. Memory access time 
and the transfer time of a data block on a bus are 
assumed to be fixed at T and #, cycles, respectively. 
The cycle time of a cache memory is assumed to be 
constant value which is same as a processor cycle. 
These assumptions are reasonable because for a given 
design of a machine, the basic data unit that is 
transferred between the main memory and caches 
(such as a block) or between a processor and its local 
cache (such as a word) is fixed in size. However, the 
model developed in this paper can be easily extended 
to deal with generally distributed random variables of 
memory access times and bus transfer times. 


The work load model selected for our analysis is 
similar to the one developed by Dubois and Briggs [4]. 
The memory reference stream generated by a proces- 
sor is the merging of two streams: one for private 
memory blocks that can be accessed only by one pro- 
cessor and the other for shared blocks which can be 
read or written by any processor in the system. 
There are totally N,, shared memory blocks in the 
system and each private cache is assumed to be capa- 
ble of holding S, cache blocks(shared or private). 
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Figure 2. Markov state diagram for a memory block. 


The probability that a memory request issued by a 
processor addresses one of the N,, shared blocks is 
represented by g, and the probability of addressing a 
private block is 1-q,. Similarly, a memory request is 
a read with probability f, and a write with probabil- 
ity fy- It is also assumed that a shared memory 
request addresses any one of N,, blocks equally 
likely, i.e. a uniform reference model for shared 
blocks is assumed. If a request is for a private block, 
it is a cache hit with probability h and miss with 
probability 1-h. When a cache miss occurs, the 
cache controller randomly selects one cache block to 
be replaced to make room for the incoming block. In 
most of the existing cache systems, LRU (Least 
Recently Used) replacement policy is primarily 
employed. The random replacement policy is 
assumed here to simplify our analysis. One should 
note, however, that this random replacement algo- 
rithm does exist such as in PDP-11/70. With this 
replacement algorithm, the probability that a shared 
block is selected to replace is equal to the percentage 
of shared blocks in the cache at the time when the 
miss occurs. If the selected block to replace is 
private, it is modified and needs to be written back 
with probability md. Hence, the probability that it 
is not modified and no action is needed is 1—md. 


Consider a particular processor in our multipro- 
cessor system. It can be in one of two states: busy or 
idle. A processor is said to be busy when it is doing 
some internal computation and the processor is idle 
after it issues a memory request until the request is 
satisfied. If the memory request results in a cache hit, 
the processor resumes busy again immediately after 
the cache operation is complete. Additional delay 
may be incurred if one of the followings is true. 1 
The request results in a cache miss so that a bus 
transaction is needed; 2) The request is a write that 
modifies a shared and clean cache block. In this case, 
the processor can not perform the write immediately 


until other copies of the memory block are invali- 
dated; 3) The cache controller is busy serving requests 
from snooping controllers that observed bus transac- 
tions which require the cache controller to invalidate 
certain cache block or to supply a cache block on to 
the buses. The proportion of the time that the pro- 
cessor remains busy is defined to be the processor 
utilization. Since the system throughput is directly 
related to this processor utilization, we shall use the 
processor utilization and processing power (the pro- 
duct of processor utilization and number of processors 
) as performance metrics in our analysis. 


3.2. The Model 


The cache based multiple-bus multiprocessor 
described above is modeled by a mixed queueing net- 
work that consists of both closed and open customer 
classes I , as shown in Figure 3. The processors are 
modeled by delay servers labeled P;’s. The memory 
modules, M, M, --- My, are represented by FCFS 
service centers with service rate 1/7. The N private 
caches C;,’s are also modeled by FCFS service centers 
with service rate of 1. The possible customers” in a 
private cache queue are the requests from its local 
processor and the requests from other processors for 
invalidation and supplying data, etc.. The bus sys- 
tem is represented by a flow equivalence service 
center [7] as shown in the figure. In case of central- 
ized control, the bus system queue is simply a load 
dependent service center. There are a total of N dis- 
tinct customers in the network that belong to closed 
class, each of which is associated with a specific pro- 
cessor. The routing of each of these closed class cus- 
tomers can be described as follows. 


Initially a customer stays in the delay center 
associated with it for a random amount of time to 
represent the thinking period of the processor. The 
processor then requests a cache service (read or write) 
and the *customer” is threaded to the cache queue. 
If the request can be satisfied within the private 
cache, the ”customer” goes back to the delay center 
and the processor stays busy again. If the cache 
request results in a cache miss or a cache hit but 
requiring invalidation signal to be sent, then a bus 
transaction is necessary. In this case, the request will 
join the bus system queue in which a free bus is to be 
obtained. Once the processor gets a bus, the request 
packet will be transmitted to different places in the 
system depending on the nature of the request. 
According to the write-once protocol, the requested 
memory block can be acquired either from one of the 
memory modules or from one of the remaining 
caches. The block is supplied by a memory module 
only when none of the remote caches has a dirty copy 
of the requested block. In this case, the request 
packet joins one of the memory queues egy 
likely). Once the request packet reaches the head of 
the queue, the required memory operation takes 
place. After the operation is finished, the memory 
controller formats a response packet that contains 
both the requested data and the identification of the 
requesting processor. The response packet is then 
returned to the requesting cache through one of the 
system buses and the particular data item requested 
by the processor goes directly to the processor. Simi- 
larly, if the missed block is to be supplied by a 
remote cache, instead of joining a memory queue the 
request packet joins the corresponding cache queue 
from where the cache controller supplies the data 
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through the system buses. It is clear that the exact 
routing behavior of a ”customer” depends on a set of 
routing probabilities. For instance, upon emerging 
from a cache queue, the customer may go back to its 
processor with probability R,, and go to bus queue 
with probability R,,. Determination of these proba- 
bilities requires a careful analysis of stochastic sharing 
behavior of the system. This is the task of next sub- 
section. 


Readers may have already noticed that in our 
cache coherence protocol there are cases that a 
memory request may spawn into a number of requests 
that go to different places in the network. For exam- 
ple, when a processor writes into a valid cache block, 
both write-through into main memory and invalida- 
tion signals to other caches are to be sent. Also when 
a block being fetched upon a miss arrives at a proces- 
sor, the requested word in the block goes to the 
requesting processor directly and the block needs to 
join the cache queue to be written into cache 
memory. As another example, if a cache observes a 
DR or SR request on a bus and-it has a written-once 
copy of the requested block, then appropriate state 
changes are necessary. In all these cases, more than 
one ”customers” are generated by one customer. 
This phenomenon is called customer forking in the 
queueing network and is very difficult to analyze, if 
not impossible [7]. However, a careful examination of 
the cache protocol shows that the only: effect of an 
invalidation signal on the system performance is its 
contribution to the load of the cache queue in which 
the invalidation is to be done. And similarly the 


effects of write-back and write-through into the main 
memory are contributions of the traffic load on the 
Therefore, to capture these 


buses and memories. 


Figure 3. The queueing network model for the cache 
based multiple-bus multiprocessor system. 


effects we shall use open class customers by adding 
sources and sinks as shown in Figure 3. The sources 
SC;’s and sinks SK,’s are used to represent the effect 
of invalidations, block loading and state changing on 
the caches and SC,,-SK,, and SC,,,—SK,, are used 
to represent the effect of write-back and write- 
through on the buses: and memories, respectively. 
The arrival and departure rates of these sources and 
sinks will be discussed shortly. 
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3.3. Routing Probabilities 


Recall that a memory block can be in one of five 
different states as seen by a processor: not-present(0), 
valid-and-clean(1), once-written(2), dirty(3) and 
invalid(4). The Markov state diagram that represents 
these states is shown in Figure 2. Let 2; denote the 
steady state probability of a memory block being in 
state 7. Note that the state of a memory block can 
be changed by a request either from the processor or 
from a system bus. Let p,; be the probability that 
the particular memory block 7 is addressed by a 
request generated by the local processor. The state 
transition rate of a shared block 1, from state 0 to 
state 1 is q,p;f,P,P,/Z and to state 3 is 
q:P; fyP,P,/Z due to the local processor read and 
write respectively, where P, is the probability that a 
processor can get a bus at a cache cycle. Bus 
requests, on the other hand, are mainly the effects 
from remote processors. For example, if the local 
cache observes an SR request on a bus and it has a 
dirty copy of the requested block, this SR request 
brings the block from the original dirty state to 
valid-and-clean state(i.e. from 3 to 1). This is because 
the local cache supplies the block to the requesting 
cache, writes it back to the main memory and 
changes the block state to valid-and-clean. As 
another example, the state of a block changes from 
valid-and-clean(1) to invalid(4) upon receiving a WI 
for that block on a bus. Let B,,;, Ba,;, and B,,; 
represent the probabilities that there is a bus request 
on a bus for block 2 that is of SR, DR, and WI 
respectively. Then we get a set of transition rates of 
the Markov process as marked in Figure 2. 


It should be pointed out that our Markov state 
diagram represents discrete Markov process based on 
cache cycles. However, the state transition between 
state O and other states may take more than one 
cycle due to the bus contention. This effect is taken 
care of by incorporating P,’s into the transition rate. 
Since our purpose at this moment is to derive the 
steady state probabilities of a memory block being in 
each state, we will momentarily set P, to be 1. The 
bus contention will be modeled at higher level queue- 
ing model. Obviously, the Markov process shown in 
Figure 2. is aperiodic and irreducible. Solving the 
balance equations yields the following expressions for 
1; ’S. 
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where p; ; is the transition rate from state 1 to state 
j as shown in Figure 2. And 0, 6, a and ¥ are given 
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Having obtained the values of 7;’s, we are now 
ready to derive the routing probabilities of queueing 
network of Figure 3. Let us consider R,,, the proba- 
bility that a customer will go back to the processor 
after it comes out from the cache queue. R,» is also 
the proportion of cache requests that can be success- 
fully served by the local cache. Clearly, it is given by 


Rep =(1-9, )h +95 f ,(my+79+73)+4, fw (19+773), 


The proportion of cache requests that need to access 
bus system, R,, , is simply 1-R,,. A memory request 
on a bus can be served by either a cache or a memory 
module as described previously. Hence, a customer 
coming out of bus system may get into service center 
representing memory modules or go to one of remain- 
ing cache queues. The probability of going to a cache 
queue, #,., is given by 
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and the probability of going to one of memory 
modules is given by R,,,=1-R,,. It is assumed in 
our analysis, that once a request goes to the memory 
queues it will join one of the M memory queues 
equally likely. Similarly if the request goes to cache 
queues, it will enter any one of N-1 cache queues with 
same probability. 


The behavior of the open customers is deter- 
mined by the arrival and departure rates. In order 
for the system to be stable, the arrival and departure 
rates for each open class must be equal. Let R,,, 
R,,, and R,, be the arrival rates to the system from 
source SC;’s, SC, and SC,,, respectively. They are 
given by 
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3.4. Solution of the Model 


In the queueing network model defined above, 
each of N closed class customers has its own routing 
behavior which differs from others. For example, a 
customer originating from delay server P;, will never 
visit the delay server labeled P; for 1 ~ 7. Thus, it 
is necessary to use multiple class solution technique 
[7] to solve the model. Moreover, the bus system 
queue in the model has load dependent service rate. 
Therefore, we end up with a mixed, multiple class 
queueing network containing a load dependent service 
center. The exact solution of such a model can be 
obtained by applying exact multiple class MVA algo- 
rithm[7| provided that all the service times of FCFS 
servers are exponentially distributed. In our case 
where the memory access time, bus transfer delay and 
cache cycles are constant values, the heuristic algo- 
rithm proposed by Reiser [11] can be used. However, 
the algorithms described above requires large amount 
of CPU time and memory space. In fact the time 
and space complexities of the algorithm are of O (2% ) 
which is an exponential function of the number of 
processors. In order for our model to be practically 
useful, we shall develop a less complex algorithm that 
can be used to estimate the system performance for a 
variety of system parameters. 


We begin with defining an aggregated queue that 
consists of N-1 processors and cache queues as shown 
in Figure 4. This queueing network is obtained by 
shorting out one processor-cache system, bus, and 
memory system queues. The queueing network of 
Figure 4 will be solved in isolation for each feasible 
population. The solutions of this model will then be 
used to solve the high level queueing model shown in 
Figure 5. In this high level queueing model, the 
entire queueing network of Figure 4 is considered as 
one single flow equivalence center. The processor- 
cache system (P, and C,) in Figure 5 is then used to 
capture the detail behavior of the processor and cache 
operations while the rest of N-1 processors and caches 
are considered as a flow equivalence queue that cap- 
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Figure 4. The queueing network representing 
N-1 processor-cache system. 


tures only their aggregated effects on the entire 
queueing network. 


In solving the low level queueing model of Figure 
4, we still use the heuristic multiple class MVA [11] 
but the simplified approximate one [7] because the 
model is much simpler than the original model. The 
solution outputs of this model that are relevant to 
the high level model are the service rates of class 1 
customer from P, and the summation of the 
throughput of the remaining N-1 classes of customers 
for each possible placement of customers in the 
queueing network of Figure 4. Once we obtained 
these service rates, the high level queueing model of 
Figure 5 is solved with 2 classes. Class 1 consists of 1 
customer from P, and class 2 consists of N-1 custo- 
mers in the flow equivalence center. As a result, the 
time and space complexity can be reduced to approxi- 
mately O((1+N)*) instead of O(2" ) provided that 
the low level solutions are available. The complexity 
of solving low level model of Figure 4 is hard to quan- 
tify due to the iterative nature of the simplified algo- 
rithm. However, it is known that the number of 
operations per iteration is O ((NV-1)n ) for the popula- 
tion of n and empirically less than two dozen 
iterations are typically required for convergence to 
less than a 0.1% change in queue length. 


In summary, the solution of the queueing net- 
work involves the following steps: 1) Obtain the rout- 
ing probabilities of closed class customers, arrival 
rates and departure rates of open class customer by 
setting the processor utilization P, to 1; 2) Solve the 
queueing network model of Figure 4 for each feasible 
population to obtain two set of load dependent ser- 
vice rates; 3) use the results obtained in step 2 to 
solve the high level queueing model of Figure 5 to 
obtain the performance metrics including processor 
utilization P,; 4) Derive a new set of arrival and 
departure rates for open class customers by using the 
new value of P, and repeat this procedure from step 
2 until the required accuracy is reached. 


FESC 
for N-l P-C's 


Figure 5. The high level queueing network 
representing the system. 


4. Results and Discussions 


Solving the queueing network model defined in 
the previous sections, we can obtain a set of system 
performance metrics such as the mean queue length 
of each service center, the response time of a memory 
access, processor utilization and processing power. 
The processing power is the sum of the processor util- 
izations in the system which takes into account both 
the number of processing elements and the queueing 
effects of cache protocol on the multiprocessor sys- 
tem. Hence, we shall use processing power as the per- 
formance metrics in our following discussions. The 
goal of our experiments here is to study the effects of 
the number of buses and degree of sharing on the per- 
formance of the cache based multiple-bus multipro- 
cessor systems for a given set of architectural and 
work load parameters. 


The architectural parameters used in this section 
are chosen close to the statistic values that appeared 
in the literature. The size of each private cache is 
assumed to be 2K blocks and the cycle time of a 
cache is the same as that of a processor. There are in 
total 32 shared memory blocks (N,, ) in the system. 
The memory operation takes 4 processor cycles. The 
transfer time of a packet on a bus is assumed to be 1 
cycle. This assumption is reasonable because most of 
the existing system buses have data bus width of 64 
or 128 lines [9]. As a result, a block of 4 or more 
words can be transferred in one bus transaction. The 
fraction of read (f,) is 70% and write (f,,) 30%. 
The probability of a private block being modified, 
md , is 20%. 


Figure 6 shows the processing power as a func- 
tion of mean processor thinking time for 16 processors 
and up to 4 buses. The thinking time is also the 
interrequest time that indicates the offered load of 
processors to the rest of the system. The memory 
request. rate which is the reciprocal of the thinking 
time represents the behavior of programs that are run 
on the multiprocessor system. Obviously, CPU- 
bound jobs have low values of request rates that gen- 
erate less memory requests and consequently requires 
less communication overhead. As a result of this, the 
performance difference between the systems with 
different number of buses is not significant at very 
low value of request rate. However, as the request 
rate increases, the performance gain of more buses 
becomes significant. In particular, the processing 
power is almost doubled by adding one more bus to 
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processing power 


the single shared bus multiprocessor for the request 
rate more than 0.4. 


Figure 7 shows the system processing power as a 
function of number of processors and for same 
numbers of buses. For a small number of processors, 
the performance difference due to different number of 
buses is not significant, which indicates that a single 
bus does not create a severe system bottleneck. This 
result is consistent with that observed by Archibald 
and Baer [1] in their simulation studies. However, as 
the number of processors increases, the difference 
between the curves becomes large. In other words, 
the single bus gets saturated very quickly and 
degrades the system performance. For a large system 
size, the difference in behavior for different number of 
buses will be more pronounced. 


The effects of the degree of sharing on the sys- 
tem performance are shown in Figure 8, where three 
curves are illustrated for different values of q,, the 
probability that a given request is for a shared 
memory block. As expected, larger g, increases the 
bus traffic in order to enforce the coherence protocol. 
Moreover, each cache controller dedicates more time 
to update remote requests from the system buses. As 
a result, a lower processor utilization is observed. 


5. Conclusions 


As one type of high bandwidth interconnection 
network for multiprocessors, multiple-bus structure 
has drawn a considerable interest among computer 
architecture community. A great deal of work has 
been done in design and analysis of such structures. 
However, the previous analyses of multiple-bus struc- 
tures did not consider the effect of cache coherence 
protocols. In this paper, we consider the packet- 
switched multiple-bus multiprocessors that use a 
modified write-once protocol to enforce cache coher- 
ence. A queueing network model that consists of 
mixed multiple class customers has been developed. 
The model captures the effects of both the multiple- 
bus contention and the cache coherence protocol on 
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Figure 6. Processing power as a function 
of request rate. 
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Figure 7. Processing power as a function 
of the number of processors. 


the system performance. To reduce the complexity of 
solution of the model, a simplified algorithm based on 
flow equivalence technique has been developed. 


From the numerical results obtained from the 
model, we conclude that both the number of buses 
and the degree of sharing have significant effects on 
system performance. A high bandwidth network such 
as multiple-bus is necessary for a large system 
because the single bus gets saturated very rapidly and 
creates a system bottleneck. Although the snooping 
system and cache controller designs become complex 
with the increase in the number of buses, we believe 
a the design of a 2 to 4-bus system is quite feasi- 

le. 
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Abstract 


Software—assisted cache coherence enforcement schemes 
for large multiprocessor systems with a shared global memory 
and interconnection network have gained increasing attention. 
Such schemes rely on software to decide which memory refer- 
ences access potentially stale cache copies of variables. The 
algorithms used usually overestimate the number of such 
references. Few have pursued techniques to more accurately 
identify accesses to stale cache copies. In this paper, we pro- 
pose an approach based on flow analysis to detect such 
accesses. Software—based cache coherence schemes that can 
utilize the detected results are discussed. We then show our 
recently proposed approach which has less unnecessary invali- 
dation and is faster than other proposed coherence schemes. 


1. Introduction 


Properly managed private cache memory in multiproces- 
sor systems with shared global memory and interconnection 
networks can decrease memory access time and reduce net- 
work congestion and contention in the shared memory. How- 
ever, cache coherence has to be enforced before private caches 


can be effectively used in such systems. The discussion will be: 


restricted to the problems of maintaining cache coherence in 
large-scale multiprocessor systems with interconnection net- 
works and shared global memory (or simply large-scale mul- 
tiprocessor systems). 


A cache coherence violation occurs when a cache copy of 
a processor does not have the up-to-date value of the vari- 
able. Such a copy is called a stale copy and an access to the 
copy a stale access. 


Most proposed cache coherence schemes rely entirely on 
run-time mechanisms to maintain cache coherence. All 
schemes in this category use some form of shared media, a bus 
or directories, such that a processor can either monitor or be 
notified of each modification of a variable by another processor 
(Tang76, CeF’e78, Good83, ArBa84, McCr84, PaPa84, RuSe84, 
KaEW85]. Cache copies that become stale due to a 
modification by other processors can then be updated or 
invalidated, and cache coherence is maintained. However, 
these schemes are not suitable for large-scale multiprocessor 
systems since bus—based systems support only a limited 
number of processors. On the other hand, directory 
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approaches are expensive to construct and cause global 
‘memory. accesses to be serialized to different degrees, or they 


incur high communication cost. 


Software—assisted approaches have been proposed and 
present an alternative for cache coherence in large-scale mul- 
tiprocessor systems. In these approaches, potentially stale 
cache copies of variables are purged at specific locations in 
program execution. Copies known to be up-to-date, which 
are usually kept in the shared global memory, are then 
accessed [Smit85, BrMW85, EGKM85, McAu86, Veid86, 
LeYL87, Lee87, ChVe87|. High communication cost associated 
with run—time detection of stale cache copies is avoided. 


Efficient invalidation of stale copies is crucial to software 
assisted schemes. However, existing schemes also invalidate 
non-stale cache copies and may result in low hit ratios. This 
is the price paid for not relying on run-time detection of stale 
cache copies. To limit the number of invalidations of non- _ 
stale cache copies, alternatives to run-time detection of stale 
accesses are important but have not been fully addressed in 
software—assisted schemes. 


The objectives of this work are (1) to develop a compile— 
time flow analysis algorithm to detect stale accesses in parallel 
programs, and (2) to discuss possible enforcement schemes 
based on the detection results. 


In the following sections, the formulation of the cache. 
coherence problem in terms of flow analysis is presented first. 
Then, an algorithm for stale access detection is introduced. 
Possible cache coherence schemes based on the detection result 
are discussed. Finally, a brief discussion of the scheme that 
we recently proposed [ChVe88] is presented. 


2. The Cache Coherence Problem and A Flow Analysis 


Model 


Efficient cache coherence maintenance depends on the 
stale access detection. In this section, the issues associated 
with compile-time detection of stale accesses and the proposed 
solutions are covered. 


2.1. Assumptions about Parallel Execution 


For clarity, our discussion will focus on parallel programs 
with Doall—type [LuBa80] parallel loops (loops with no depen- 
dences across iterations). Barrier synchronization is assumed 
at parallel loop boundaries. Also, it is assumed that synchroni- 


zation operations are necessary to preserve the correctness of 


the program; otherwise, a variable cannot be written and read 
by distinct processors. All operations and memory accesses of 
processors need to be completed before they can proceed 


across a synchronization point. Synchronization variables are 


not cacheable. It is also assumed that processor assignment to 
a parallel loop is unknown at compile time. The following dis- 
cussion focuses on intra—procedural analysis and an empty 
cache is assumed at the beginning of a procedure. 


2.2. Conditions for Stale Accesses Detection 


Following the ideas in [Veid86], the occurrence of a stale 
access to a variable X, by a processor P, is determined by the 
following sequence of events in execution: (e,) a cache copy of 
X;, is loaded in the cache of the P,, (synchronization opera- 
tions take place), (e,) the latest write to X, is executed by a 
processor P,;, 7 #1, (synchronization operations take place), 
and (e3) processor P; reads X,. The synchronization opera- 
tions in the sequence are implied under the earlier assumption. 
Since our discussion is focused on parallel programs with Doall 
loops, synchronization is done where a processor assignment 
(PA) to loop iterations takes place. However, the approach is 
general enough to deal with other cases. 


Notice that the sequence of e, and e, constitutes the 
sufficient condition for the existence of a stale copy of X, in 
P;. A stale copy, unless being accessed, does not cause 
incorrect computation. However, since events e, through e, in 
a sense determine whether processor P; in e3 will access a 
stale cache copy, let us call e, through e, collectively, deter- 


mining sequence (DS) of the stale access to variable X, by P;. 


Compile-time detection of stale accesses faces a major 
obstacle. Namely, the details of processor assignment are unk- 
nown at compile-time. For example, it may not be known at 
compile-time whether processors in e, and e, are the same. If 
they are, the access in e3 will not be stale. The identities of 
sprocessors involved in the sequence are important. Without 
this knowledge, a compile-time detection scheme can not accu- 
rately predict the occurrence of the DS. 


The next best thing to detect is a set of sequences of 
events which includes the DS. Such a set of sequences serves as 
a necessary condition under which some processor may access 
a stale copy of a variable X,. The set is represented by the fol- 
lowing relazed determining sequence (RDS) of variable X;,: 
(E,) a cache copy of X, is loaded into the cache of some pro- 
cessor, (£,) one or more PA’s occurs, (£3) a write to X; is exe- 
cuted, (£,) one or more PA’s occurs, and the event pair EL, 
and FE, may be repeated. 


The DS of a stale access to a variable by a particular 
processor is a specific case of the RDS of the stale access to the 
variable. E, establishes the existence of an initial cache copy 
of a variable. Each occurrence of E represents a new value 
assigned to the variable. The PA separating the writes imply 
that each such write may be executed by a different processor 
and can turn existing cache copies stale. The terminating PA 
of the RDS is necessary to determine that the lastest write in 


the RDS and the subsequent read after the RDS may be exe- 


cuted by different processors. Hence, a read following these 
events may use any one of the cache copies which are all stale 
except the one written last. 


When an RDS of X, precedes a read from X, with no 
write in between, the read from X; is considered a stale access. 
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In some cases, an RDS of X,, a write to X,, and a read from 
A, occur in a sequence. If the execution of the read implies the 
execution of the preceding write to X,, the read from X, is not 
stale. This will be discussed in details in later sections. 


Notice that an RDS contains no read accesses except pos- 
sibly in EF, since a read access does not modify the existing 
value of a cache item and hence does not cause staleness. To 
simplify detection, one needs to look only for write events. 
The following sequence, however, will not be detected: a read, 
PA, a write, PA, .... This is the case when a read is not pre- 
ceded by any write to the same variable in a subroutine. For 
such a read, a dummy write to the variable is assumed in the 
analysis. The dummy write accounts for the initial loading of 
the cache copy due to the read. 


2.3. Flow Analysis Formulation and the Flow Graph 


To detect an RDS at compile time, let us look for an exe- 
cution sequence of at least two ordered pairs of write —- PA 
events at compile time. Flow analysis is a good tool for such a 
task. 


Let us now formulate the detection in the flow analysis 
terms [AhSU86]. A write access to a variable is a definition 
(def) and a read access is a use. A def that is followed by PA’s 
is called a determining def (DD). Given a set of DD’s of a vari- 
able, each DD represents a potentially distinct cache copy. 
Thus, the detection of the RDS of a variable is nothing but 
finding consecutive occurrences of DD’s (By consecutive 
occurrences of DD’s, the sequence of write — PA, write — PA, 

. , ls implied in order to be distinguished from the sequence 
of write, write, ..., PA. The latter does not form an RDS). 


The reaching def algorithm [AhSU86] can determine 
whether a value defined earlier in program execution can be 
used by a read reference. It will be applied to determine 
whether distinct cache copies, represented by DD’s, can be 
used by a read access. 


To find reaching defs, the concepts of kill, generate, and 
reach are used. A def of variable X is generated by a 
statement S,; when the statement contains a write reference to 
X. A def of X, d;, in S; is killed by another def of X, d;, in 
S; if there is a path in the flow graph from S; to S;. A def of. 
X, d;, in S; is said to reach S, if there is path from S; to S; 
and no other def of X ktlls d; on the path. 


The goal of using the reachtng def algorithm is to check if 
DD’s can reach a use without being killed so that the use may 
access the values associated’ with such DD’s. A def, d;, kills 
another def, d;, with respect to a subsequent use when the use 
of the variable accesses only the value written by d;. 


The following two facts are important for detecting an 
RDS and ultimately stale accesses. First, for cache manage- 
ment purposes, it is assumed that a new def of an array vari- 
able well result in generating new values, thus potentially 
creating new cache copies. Since it is not known for sure if a 
new assignment generates the same array elements as the pre- 
vious one, let us assume that array defs are not killed by sub- 
sequent defs. Secondly, consider the flow graph. Let d; be the 
a def of a scalar X between two adjacent PA’s, PA; and 


PA; +1: Let the execution of uses between d; and PA;,, always 


be preceded by d;. Then, these uses access only thé value 
written by d;. However, the uses after PA;,, may access the 
value written by d; or values by defs prior to PA;. Thus, the 
"scope of the ksll by d, is limited only to uses between d; and 


PA;,,. For uses subsequent to PA;,,, defs prior to PA, are 
not killed by d;. The defs prior to PA; can thus reach the 
uses subsequent to PA;,,. This is called reaching due to pro- 
cessor assignment. 


2.4. Modified Flow Graph 


A flow graph that models parallel execution is created by 
adding a cobegin and a coend node at the beginning and the 
end of each parallel loop. These nodes are the places of proces- 
sor assignments (PA’s). The nodes of a flow graph are basic 
blocks [AhSU86]. Each of the cobegin and coend nodes is con- 
sidered as a special type of basic block, namely, a processor 
assignment block (PAB). 


The reaching def algorithm applied to conventional flow 
graphs cannot properly handle reaching due to processor 
assignment. The flow graph is modified to correct this. An 
edge serving as a by—pass for defs prior to a PA; to reach the 
uses after PA;,, is added to adjacent pairs of PA's. Since 
scalars are not written in parallel loops, edges are only added 
from the coend node of an outermost parallel loop to the cobe- 
gin node of the outermost parallel loop(s) next in the flow 
direction. These are called augmenting edges. The examples 
of a flow graph and its modified version are given as Gp» and 
G',, respectively, in Figure 1. 


2.5. The Detection Algorithm 


Detection of stale accesses is performed as follows. First, 
a flow analysis algorithm is applied to G, to compute the DD’s 
reaching each block. Secondly, the existence of an RDS for 


BOY | d1:X(j) =... 
if () goto BO 


Bl 


B2 
edges 


d4: X(j) = Y(j-1) ... 
d5: X(j+1) =... 
if () goto B4 


B5 if () goto B7 Oy) 
Me 
B6 


B8 { /d8: R(j)=X(j)+W(j) 


if (S(j)..) goto B8 


a7. XG) = X(j+1) “] 
if () goto B7 
Go 


Figure 1. G, and G, of an example. 
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each read reference in the block i$ determined from the DD’s 
reaching a block. Finally, each read reference in a basic block 
is checked to determine if it indeed accesses a potentially stale 
cache copy. 


First, a gen set and a kill set are computed for each basic. 
block, B. The gen set, gen(B), contains the statements 
which have a def of a variable that is not killed in B. The kill 
set, kill(B) contains the set of statements, not in B, with defs 
that will be kelled by a def in B provided there is a path from 
those statements to B. Each block is also associated with the 
following sets: in (B) set, in'(B), out(B), and out'(B). in(B) 
contains defs reaching B. in'(B) contains the DD’s reaching 
B, namely, defs which reach a cobegin or coend node before 
reaching B. Defs that reach the entry of B or are generated in 
B are included in out(B) if they also reach the exit of B. If B 
is not a PAB, out'(B) contains DD’s that reach B but are not 
killed in B. If B is a PAB, all defs reaching B are turned into 
DD’s and are included in out'(B). The sets are defined by the 
following equations: 


out (B) = gen(B) U(in(B) — kill (B)) 


in'(B) — kill (B) 


if B is not a PAB 
out!(B) = int(B) L in (B) 


if B ts a PAB 


in(B)= (J out (B;), 


for all 7 


in'(B) = out'(B;), 
for all 7 


where B, is an immediate predecessor block of B. 


Next, an iterative algorithm is used to find in(B) and 
in'(B) for each block B in G,. The algorithm converges when 
in(B) and in'(B) of every block both stay unchanged for two 
consecutive iterations. When the algorithm converges, in’(B) 
contains all DD’s reaching B. | 


For convenience in our discussion, let in'(B) be parti- 
tioned into subsets each containing statements defining a 
specific variable. For a variable X, let in',(B) represent the 
subset in in/(B) such that, 

in'(B)= (J _in'y(B), 
for all X 


-where X is a variable that has a def reaching B. The results 


of the algorithm on the example program in Figure 1 are given 
in Table 1. 


2.5.1. Finding an RDS from Determining Definitions 


The next step is to find out if the in’(B) to a block may 
result in an RDS. The conditions under which a pair of dis- 
tinct DD’s of the same variable reaching a block may result in 
an RDS are shown below: 


Lemma 1 


Two distinct determining defs of a variable d, and d., 
reaching a block B,, may result in an RDS to B, if 
d, E in'(B;), where d; ‘= B;. ; 
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Table 1. Analysis results for the example in Figure 1. 


Proof 


By the definition of in'(B), if d; € in'(B;) and d; € B,, a 
path between the block containing d; and B, must maa a 
cobegin or a coend node. Since d,; is a DD penchiag B,, there 
is a cobegin or coend node on the path between B, and B,. 
Then the following sequence is obtained: (£,) d;, (E,) a PA, 
(Z3) d;, and (E,) a PA. 


A single determining def in the source program may also 
result in an RDS to a block. This is the case when d, is 
enclosed by a serial loop or a backward branch that also 
encloses a parallel loop. This case is covered by the following: 
Lemma. 


Lemma 2 


A determining def d;, reaching B,, results in an RDS to 
B, if d; € n(B;) and d; € B,. 


Proof 

By the definition of in'(B;), if d; € in'(B;) and d; € B;, a 
backward path to B; must contain a cobegin or a coend node. 
Since d; is a DD reaching B,, d; must reach a cobegin or coend 


node before reaching B,. Thus, to block B,, there exists the 
RDS: (£,) d;, (Z,) a PA, (E3) d;, and (E,) a PA. 


The following theorem can be used to detect the 
existence of an RDS. 


Theorem 1 


The in'y set of B, results in an RDS to a read access to 
X in B, if and only if the following is true: (1) there exists 
d; € in'y (B,) such that d; €in'y(B,;) and d; €B,, or (2) 
there exists {d; d;}C in 1 (By) such that d; € in 1 (B, ), where 
d, € B;. 
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Proof 

The ¢f part is straight forward by Lemma 1 and Lemma 
2. The proof of the only-if part is as follows. Suppose that 
neither (1) or (2) is true. The case of an empty in'y is trivial. 
If in'y (B,) is not empty and (1) and (2) are both false, the fol- 
lowing must be true: blocks containing members of in'y(B,) 
are not connected by paths containing cobegin or coend nodes 
since no member of in'y(B,) appears in the in'y sets of other 
blocks containing members of tn',(B,). Then, there is no PA 
between the execution of the def/s). As a result, there can be 
no RDS involving any def pair d; and d,; or multiple d.'s. It 
has just been shown that the in’,(B,) cannot result in an RDS 
if (1) and (2) are both false. So if tn',(B,) results in an RDS, 
either (1) or (2) has to be true. Q.E.D. 


2.5.2. An Implementation of RDS Detection 


The following is a possible implementation of RDS detec- 
tion from the tn'y set. For each variable X, the DD’s in the 
union of in’y sets of all blocks and the RDS’s formed by DD’s 
are represented by an undirected graph M IN' V,E), in which 
a node v; € V represents a DD and an edge e € EL, between 
nodes v, - v;, denotes that the DD’s represented by v,; and v, 
together eeaalt in an RDS. 

To determine, for block B; of G,, if in'y(B;) of a vari- 
able X results in an RDS of X, ’M in'y(V,H) is examined. For 
a def, d;€ in'y (B;), if there is an edge in My (V,H), between 
the node representing the d; and a node representing another 
def, d,€in'y(B,;) or d;, the in'y(B;) set results in an RDS to’ 


i 
My, (V,£) can be represented by an adjacency matrix 
Mi, The Miy,(V,£) graphs and the M;y:, matrices for the 


variables used in our example are shown in Figure 2 and Table 
2. The in'y(B;) sets for different blocks containing uses of 
variable X are shown as augmented columns to the matrix. 


Var Y 


Var W Var S 


Figure 2. My V,£) of different variables. 


Table 2. M;,y: matrices for the example. 


Take B, as an example. Its in'y set contains d, d, and 
d, (in the augmented column with tn'y and Bz, as headings). 
From the adjacency matrix, since entries (row d,, col d;) and 
(d,, d,) are 1, either one of these DD pairs in in'y(B,) forms 
an RDS. 


2.6. Finding Stale Uses in a Basic Block 


Now, the nature (stale or non-stale) of uses in a basic 
block can be determined. The uses of scalars and arrays are 
treated differently. 


For uses of scalars in a basic block, uses are divided into 
upwardly—ezposed uses and non—upwardly—ezposed uses. 
An upwardly—ezposed use is a use in a block which is pre- 
ceded by no def of the same variable in the block. An 
non—upwardly—ezposed use is preceded by at least one def of 
the same variable. 


Theorem 2.1 


An upwardly—exposed use of a scalar variable X in a 
block is considered a stale access if in'y of the block results in 
an RDS. 


Proof 


If in'y to the block results in an RDS in the system, the 
def in E, in the RDS has turned the cache copy, written by the 
def in EF, into a stale copy prior to the execution of the block. 
It is possible that the processor executing the block has the 
stale copy of the variable in its cache. | 


Non—upwardly—ezposed uses are not stale regardless of 
whether the tn! set of the variable results in an RDS, since the 
preceding def in the same block always supplies the uses of the 
scalar with an up—to—date copy. 


Recall that a def of an array does not kill an earlier def 
Thus, a use of an array element may not access only the 
values written by a preceding def in the same basic block. The 
use has to be assumed stale. Therefore, for array variables, 
no distinction between upwardly —exposed and 
non—upwardly—ezposed uses is made. The following theorem 
is for detecting stale accesses to arrays and the proof is similar 
to the one for Theorem 2.1. 


Theorem 2.2 


A use of an array variable X in a block is considered a 
stale access if in'y to the block results in an RDS. 


In our example, the uses of X in B, and B, are non-stale 
accesses, and those in Bg, and B, are stale. The use of W in By 
and the use in Y in B, are stale. The use of S in By is non— 
stale. 


2.6.1. Optimization Using Data—~Dependence Analysis 


The algorithm in the above discussion does not use sub- 
script expressions when dealing with array variable accesses. 
As a result, an array def is included in in’X (B) even if the 


the elements assigned by the def are not used in B. This hap- 


pens when the subset of array elements used has no element in 
common with the subset defined. A larger in'y set to a block 
is more likely to result in an RDS, hence the uses are more 
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likely regarded as stale accesses. Also, when two DD’s of an 
array write two disjoint subsets of array elements, these DD’s 
should not result in an RDS even if the conditions of Theorem 
1 are met. If such a case can be detected, fewer stale uses will 
occur. Another result is Theorem 2.2, in which no distinction 
between upwardly-ezposed and non—upwardly—ezposed uses of 
arrays is made, regardless of whether the use will access only 
up-to-date copies supplied by preceding defs in the same 
block. Using data-dependence analysis, a more accurate 
detection can be achieved. 


Data dependence analysis can be used to refine the in'y 
for a use of array X or to detect that two DD’s in in'y write 
no array element in common. A reaching def can be deleted 
from the in'y for a use of X if the use does not depend on that 
def of X. A use with a smaller in’, has a better chance of 
being a non-stale access. Since each use of array X may have 
a different subscript expression than the other uses in a basic 
block, the in'y set has to be refined each time a use of X with 
a different subscript expression is investigated. 


On can also use the analysis of subscript expressions to 
determine whether the uses are preceded by defs with the 
same subscript expression in the same block. Such uses will 
not be treated as stale accesses. This can be done by using 
direction vectors [Wolf82]. A use preceded by a def, with an 
all-equal direction vector in the same block, is not stale. 


3. Possible Coherence Schemes by Selective Invalida- 
tion 


Software—directed cache coherence schemes can benefit 


from the stale access detection. Possible approaches to invali- 


date stale cache copies and related issues are discussed. Then, 
a scheme that uses the detection results and a fast selective 
invalidation approach is shown. 


3.1. Invalidation Schemes 


There are several ways to make a reference access only 
the up—to—date copy of a variable. It can be achieved either 
by invalidating the stale cache copies, by loading the up—to-— 
date copy by forcing a cache miss (also considered as a form of 
invalidation), or by updating all cache copies as soon as a 
modification is done. We believe that the last approach is not 
suitable for large-scale multiprocessor systems because it 
necessarily (1) has high communication cost, (2) updates cache 
copies which are not going to be used later, or (3) requires 
more run-time bookkeeping. The following discussion will con- 
centrate on invalidation. 


The goal is to achieve cache coherence without communi- 
cation to other processors. It can be accomplished by letting 
each processor manage its own cache through invalidation. To 
simplify the discussion, let us assume that the global memory 
always contains the up-to-date copies of the variables after 
each synchronization point or cache line replacement. 


Invalidation schemes in previous work suffer from a com- 
mon disadvantage, namely, over—invalidation. They all invali- 
date at fixed program locations, such as where processor 
assignments (PA’s) or synchronization operations occur. Some 
methods invalidate all shared variables at each such location. 
Others invalidate the entire cache (indiscriminate invalidation) 
in favor of fast invalidation time. Cache variables that are 
not stale are often invalidated [ChVe88]. 


Our discussion of invalidation schemes will focus on the 
following issues: (1) when to invalidate, (2) what to invalidate, 
and (3) how to invalidate. To address these issues, the invali- 
dation approaches are divided into two classes, the post—access 
invalidation and the pre—access invalidation. In one approach, 
each processor invalidates the cache copies after its references 
in order to prevent future stale accesses. In the other 
approach, each processor invalidates before accessing what has 
been detected as stale. Both approaches depend on the 
knowledge that an access has been detected as stale. These 
approaches are fundamental for software—directed invalida- 
tion. The division equally applies to selective and indiscrim- 
inate invalidations. However, selective invalidation is of pri- 
mary interest in this study. The indiscriminate approach has 
been thoroughly discussed in [Veid86, ChVe87, Lee87]. 


3.1.1. Post—Access Invalidation 


Post-access invalidation is a preventive coherence 
scheme in the sense that cache copies are invalidated before 
they can become stale due to a write by another processor. 
Cache copies created by read or write accesses need to be 
invalidated if they may turn stale before being accessed by 
future uses. The condition under which such cache copies are 
detected is similar to that for detecting an RDS and is given in 
the following: 


Cl. <A cache copy created by any def or use that reaches a 
PA, a def, andaPA before reaching a use ts considered a 
stale copy that may be accessed by the use. 


The post-access approach puts the responsibility to 
invalidate on the processor that creates such cache copies. Let 
us define the post-access approach as follows: 


In the post-access invalidation, a processor invalidates the 


cache copies that tt created tf they satisfy condition C1. 


In the post—access approach, invalidations can be immediate or 
delayed. In tmmediate invalidation, the cache copy is invali- 
dated right after each reference. Otherwise, the invalidation is 
called delayed. 


Immediate invalidation results in over—invalidation since 
the number of invalidations of a variable can be as large as the 
number of references. As a result, a cache item may be invali- 
dated more than once even though the one after the last refer- 
ence is sufficient. Temporal locality of multiple references of a 
variable between a pair of adjacent PA’s is destroyed. 


Using delayed invalidation, one can take advantage of 
such temporal locality and can invalidate less items. Between 
a pair of adjacent PA’s, delayed invalidation can rely on stan- 
dard optimizations to reduce or eliminate over—invalidation. 
In the case of scalar variables, what should be invalidated by 
delayed invalidation can be easily determined. Flow analysis 
can show which scalar variable is referenced on a path between 
a pair of adjacent PA’s. Invalidation of the cache copy can be. 
delayed until the processor is done with the references to the 
variable. Delayed invalidation of array variables has to 
depend on subscript analysis. However, in cases where sub- 
script analysis is inexact, immedtate invalidation, in which the 
exact subscript expression of an array reference is used to 
determine the elements to be invalidated, can be used (or 
invalidating indiscriminately before the subsequent PA). 


While it is conceivable to delay invalidations beyond the 
PA immediately following the accesses, the growing complex- 
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ity and inaccuracy of determining what to invalidate make 
such a delay impractical. Consider the case in which invalida- 
tions of the cache copies created before PA, are postponed 
until after PA; but before PA;,,. At compile-time, it may be 
impossible, especially when the details of processor assignment 
are not known, to determine exactly which elements of an 
array the processor has referenced prior to PA;. When invali- 
dation is delayed after PA;, a much larger number of the 
array elements have to be assumed present in the cache and 
they will have to be invalidated. 


Although such delayed invalidation is more difficult to 
perform, it does have an advantage; it can preserve temporal 
locality in the following case. Consider a sequence of refer- 
ences of a variable in the following form: non-stale use, PA, 
.. ,non—stale use, PA, def, PA, stale use. The cache copies of 
the variable do not become stale until the def is executed. If 
invalidation can be delayed until after the last non—stale use 
in the sequence, processors accessing these cache copies before 
the def can have cache hits. When the cache copies of the 
non-stale uses are invalidated before each PA, potential hits 
by the non-—stale accesses separated by PA’s are lost. 


3.1.2. Pre—Access Invalidation 


As a dual to the preventive approach of post-access 
invalidation, cache copies are allowed to turn stale and are left 
alone. Stale cache copies are invalidated only when they may 
be accessed. Pre—-access invalidation is defined in the following: 


In pre-access invalidation, a processor invalidates the 
cache copy before the detected stale access by the processor. 


Similar to the post-access approach, an invalidation is 
called immediate if it is done right before each stale access. 
Otherwise, it is called early. Since there can be several paths 
leading to a use, early invalidation has to make sure that the 
potentially stale cache copy is invalidated no matter which 
path is taken. 


As in the post—access approach, immediate invalidation 
has the disadvantage of over—invalidation. Neither does it 
exploit temporal locality between a pair of adjacent PA’s since 
newly loaded cache copies due to preceding uses or defs are 
also invalidated. Early invalidation has the same advantages 
as delayed invalidation in the post-access approach. Between 
a pair of adjacent PA’s, early invalidation tries to determine 
1) the first use, or 2) the first def (only for arrays) that is 
always executed prior to the stale use of the same variable or 
array element. If 1) is found, invalidation is needed only 
before the first use. Invalidation is not needed if 2) is found. 
For scalar variable accesses, flow analysis can handle the 
above easily. For array variables, subscript analysis is needed. 
When subscript analysis does not help, the worst case solution 
is to use tmmedtate invalidation, or indiscriminate invalidation 
after the PA immediately preceding a stale use. 


Invalidating earlier than the PA immediately preceding a 
stale use is difficult. The difficulties are similar to the ones in 
postponing invalidation beyond the PA immediately following 
an access in the post-access approach. A processor in pre- 
access early invalidation thus, in practice, invalidates the stale 
copy no earlier than the PA immediately preceding a stale 
access. Therefore, potential temporal locality of detected stale 
accesses across PA’s cannot be preserved. Such temporal 
locality exists in the following partial sequence of a variable: 
PA, stale use, PA, stale use, ..., PA, stale use. 


Pre-access invalidation has an advantage in that cache 
copies accessed by non-stale accesses are not invalidated. As a 
result, temporal locality of accessing non-stale cache copies 
can be extended beyond a pair of adjacent PA’s. 


3.1.3. Selective Invalidation 


Invalidation is usually accomplished by invalidate 
instructions. Since invalidate instructions consume processor 
cycles, it is essential to keep their number at minimum. Early 
and the delayed invalidations are better than tmmediate since 
multiple invalidations of the same cache item can be avoided. 
However, even though the early and the delayed approaches 
can reduce the number of invalidate instructions, to selectively 
invalidate cache items by executing the invalidate instructions 
is a sequential process, and it increases the execution time. 
Also, in the worst case, immediate invalidation has to be used 
and the number of invalidate instructions executed becomes 
even larger. 


Invalidate instructions can be replaced by reference 
marking. Stale accesses are detected and marked. Provided 
the processor can issue a different kind of read accesses for 
references marked stale, the cache controller will load the up- 
to—date copies from the global memory at access time upon 
such read requests. This is essentially a variation of pre- 
access immediate invalidation (this does not apply to post- 
access invalidation). However, it saves processor cycles since 
no explicit invalidate instructions are required. 


3.2. Fast Selective Invalidation 


‘This approach was proposed in [ChVe88]. It is a pre- 
access approach with hardware assist. Invalidate instructions, 
which increase execution time, are replaced by reference mark- 
ing. In addition, status bits in the cache memory help to take 
the advantages of temporal locality and the reduced over— 
invalidation associated only with early invalidation. Further- 
more, this approach is not affected by over —invalidation as in 
cases when subscript analysis does not help pre—access early 
invalidation. 


The fast selective invalidation scheme works as follows. 
References detected as stale accesses are marked 
memory—read and the non-stale accesses are marked 
cache—read. Each cache word has a change bit and a clear 
bit. It is assumed that the processor sets the change bits of the 
whole cache in one clock upon executing a change —cache 
instruction. The clear bit can be similarly set by a 
clear—cache instruction. The change—cache instruction is 
inserted’ after each PA. The cache controller differentiates a 
memory—read from a cache—read. Whenever a cache —read 
is executed, the cache—controller will report a hit if the refer- 
enced item is in the cache. Whenever a memory—read is exe- 
cuted, the cache controller handles the access in two ways 
when a cache copy exists: 


(1) If the change bit of the referenced word is set, the con- 
troller reports a miss. The word is loaded from the global 
memory and the change bit is reset. 

(2) If the change bit is not set and the tags are matched, the 
controller reports a hit. 


A write to a cache word always resets the corresponding 
change bit. 


One advantage of this scheme is that it is faster than 
invalidating page table entries or cache items one—by—one as in 
previously proposed schemes [Smit85, McAu86]. Also, fewer 
non-stale cache copies than in any previously proposed scheme 
are invalidated since both accesses to read-only variables and 
non-stale accesses to read—write shared variables require no 
invalidation of the cache copies. 


The change bit helps preserve temporal locality and 
helps reduce over-—invalidation between a pair of adjacent 
PA’s. In so doing, no subscript analysis is needed as in pre- 
access early or post—access delayed approaches. When a stale 
access is preceded, in execution, by another stale use or a def 
of the same element, the change bit is reset due to the earlier 
access. Even though it is marked memory—read, the reset 
change bit indicates that a fresh copy has been loaded and 
prevents it from being invalidated again by another load from 
the shared global memory. 


A restricted version of stale access detection has been 
implemented. Parallelized Fortran programs by Parafrase, a 
Fortran preprocessor [KKLW80], are used as inputs. The 
simulated performance of the fast selective invalidation 
scheme was evaluated and the results were presented in 


[ChVe88]. 


4. Conclusion 


In this paper, a flow analysis algorithm for detecting 
stale accesses is proposed. Even though the detection of stale 
accesses is an important step in software—assisted cache coher- 
ence schemes in multiprocessor systems, it has not been fully 
addressed in previous work. A modified flow graph for model- 
ing parallel execution and its effects on cache coherence is 
introduced such that the standard flow analysis techniques can 
be applied to detect stale accesses. 


Possible approaches to coherence enforcement using the 
result of the detection scheme are discussed. The advantages 
and disadvantages of such approaches are described. A 
recently proposed scheme that relies on both hardware and 
software to enforce cache coherence is shown. The new 
approach achieves better selective invalidation and does it fas- 
ter than previously proposed schemes. 


The detection algorithm proposed can also be extended 
to manage other types of memory systems, such as local 
memory and multilevel cache systems in multiprocessor sys- 
tems. Also, even though only one type of parallelism has been 
considered, e.g., Doalls, throughout this paper, the results can 
be extended to other parallel loop types and other types of 
parallelism. 
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ABSTRACT 


In many shared-memory multiprocessors, private caches are 
associated with each processor and coherence among caches is 
maintained in hardware by a cache coherence protocol on the 
memory bus. Multithreading, or the concurrent execution of 
the multiple processes forming a task is also often supported in 
these systems. The efficiency of multiprocessor systems for a 
parallel algorithm depends to a large extent on the amount of 
sharing in the algorithm and on the effectiveness of the cache 
protocol for shared data accesses. Even if the cache sizes were 
infinite, the number of processors which can be connected to a 
bus would still be limited by the bus traffic due to the initial 
loading of data and instructions in each cache, and to the active 
sharing of writable data. 

In this paper, we analyze shared data contention in parallel 
algorithms and its effects on the performance of a cache coher- 
ence protocol under the assumption of infinite cache sizes. A 
simple program model for data sharing is introduced and an an- 
alytical closed-form solution is found for all components of the 
cache coherence overhead. We then study the overhead due to 
shared data contention in five parallel algorithms: the iterative 
Jacobi algorithm, the iterative S.O.R. algorithm, the parallel 
quicksort, and the shuffling and the non-shuffling F.F.T. Fi- 
nally, these overheads are compared with the predictions of the 
analytical model. 


1. INTRODUCTION 


In modern multiprocessors, a given algorithm may be de- 
composed into cooperating processes that run in parallel [2]; 
this technique is called multitasking or multithreading. In a 
shared-memory multiprocessor processes working in parallel on 
the same algorithm cooperate through the sharing of data in 
memory. Usually, in a shared-memory multiprocessor a cache 
[16] is associated with each processor, in order to reduce both 
memory access latency and memory-bus traffic. Shared data 
may be cached provided a hardware protocol maintains con- 
sistency among multiple copies of the same data in different 
caches. By caching shared data one hopes to increase the av- 
erage hit ratio and to reduce the bus traffic. 

Three techniques are possible to analyze the performance 
of cache-based ‘multiprocessors: measurements on an existing 


146 


system, trace-driven simulations [13], and simulations or ana- 
lytical models based on a program behavior model. These three 
techniques vary in cost, flexibility and accuracy. If an analyti- 
cal model is shown valid for a number of significant systems and 
algorithms, then it can be used to predict their performance at — 
reduced cost. 

Early work on the analytical evaluation of cache-based sys- 
tems was done by Patel [14] and Briggs and Dubois [4]; these 
papers either ignored the cache coherence effect or assumed 
that shared data are not cached. In order to include the effect. 
of data coherence in the models, several authors have taken 
the approach of modeling the workload with an analytical pro- 
gram model [17]; the workload model is then used in a sim- 
ulation or in an analytical model. In [7], Dubois and Briggs 
introduced a model for multiprocessor program behavior and 
derived a closed-form solution in the case of a multiprocessor 
system with finite caches and LRU (Least Recently Used) re- 


placement policy. Archibald and Baer [3] published simulation 
results comparing various cache protocols. In [18] Vernon and 
Holliday introduced a timed Petri net model driven by the same 
program model as in [7]; no closed-form solution was proposed. 
In this paper, we introduce a new program behavior model 
based on the observation that shared data are modified in crit- 
ical sections; a closed form solution is derived for the average 
miss ratio and penalty. Finally, the model is applied to several 
algorithms and effect of cache block size is investigated. 


2. INFINITE CACHE MODEL AND GENERAL AS- 
SUMPTIONS 


Several simplifying assumptions are made throughout this 
paper. We first list them, then we discuss their validity. 


Assumption 1: The size of all caches is infinite. 
Assumption 2: The models are in steady-state. Initial tran- 
sients are not included. 

Assumption 3: Process preemption and migration are disal- 
lowed; i.e., a process executes from start to finish on the same 
processor and without interruption. 


The major motivation for studying the infinite cache model 
is its simplicity. Most parameters of the cache do not affect 
the model prediction, such as cache size, cache organization or 
cache replacement policy; the resulting models are therefore 
parsimonious. Moreover, present trends in memory chip sizes 


indicate that fast and large caches are becoming possible. In 
these caches, most of the misses are due to the initial loading 
of the data, and to coherence invalidations. It is expected that 
the infinite cache model will become more and more relevant 
as the level of integration of static RAM chips increases. 

Modeling transient effects in an infinite cache is not diffi- 
cult, but the models are not very interesting. For example, at 
program start, caches are empty; every block referenced in a 
parallel algorithm must first be brought into one of the caches; 
this initial miss is not counted in the models. The number of 
these initial misses is simply equal to the total number of differ- 
ent blocks accessed during the whole execution of the parallel 
algorithm. 

The third assumption may be the most restrictive. It is 
realistic in systems supporting group scheduling [11]. Under 
the group scheduling strategy, all processes participating in a 
task are scheduled and preempted together. We will reconsider 
this assumption in Section 7. 


3. CACHE COHERENCE PROTOCOL 


The cache coherence protocol considered in this paper is an 
invalidation based protocol described for example in [10](page 
522); other protocols have been designed [3], and the techniques 
described in this paper could be applied to any one of them. 

Generally, in a coherence protocol, multiple copies of the 
same cache block may be present in different caches, provided 
the copies are Read-Only (RO copies), that is, provided no 
processor has modified any word in the block. If a processor 
needs to modify a word in a block, it must obtain a Read- Write 
copy (RW copy), i.e. a unique copy of the block: this may 
involve invalidating copies of the block in other caches. Usually, 
a block containing only instructions, private data or shared 
constants will be tagged as RO, while blocks containing shared 
writable data may be tagged as RW. Therefore, we distinguish 
between S-blocks and P-blocks. An S-block contains at least 
one shared writable data item while a P-block contains only 
instructions, private data or shared Read-Only data. In the 
protocol selected for this study, the following cache events on 
an S-block may occur in a multiprocessor systems with infinite 
caches and in steady-state (refer to Figure 1): 


1. Miss: this event occurs when the data is referenced and 
is not present in the cache. We denote this event as M 
(Miss). All misses occurring as a result of the following 
events are accounted for as M events. 


2. Transition from RO to RW: this event occurs when a 
processor needs to modify a block already present in an- 
other cache as RO; a miss may occur and an invalidation 
must be sent to the processor(s) possessing RO copy(ies); 
we denote this event as IN-RO (for INvalidation of RO 
copy (ies); 


3. Transition from RW to RO: this event occurs when a 
processor reads a block present in another cache as RW; 
besides the occurrence of a miss, a signal must be sent 
to the cache possessing the RW copy and this cache must 
update the main memory; we denote this event as CS_RW 
(Change State of a RW copy). 
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4. Transition from RW to RWin a different cache: this event 
occurs when a cache needs to modify a block which is 
owned as RW by another cache; it implies a miss, an 
invalidation, and the update of main memory; we denote 
this event as IN-RW (INvalidation of a RW copy). 


When writable blocks are actively shared, copies must be trans- 
ferred among caches and invalidation signals must be sent. As 
the number of processors actively sharing a block increases, the 
invalidation activity usually increases. 


4. ANALYTICAL MODELS 


The access pattern to shared data in multiprocessor systems 
depends on the algorithm. We can distinguish between two 
broad classes of shared variables: synchronization data (such 
as locks) and other shared operands. Synchronization data 
are used to coordinate process execution or to protect shared 
operand accesses. 

Kung [12] classifies multitasked algorithms into synchro- 
nized and asynchronous algorithms. In asynchronous algo- 
rithms, accesses to shared operands are not protected and each | 
processor may access the data as it needs them. In synchro- 
nized algorithms, accesses to shared data are restricted, either 
by explicit synchronization or simply by structuring the fork- 
ing and joining of processes. In synchronized algorithms, shared 
writable data are accessed either in critical sections (2] (only 
one process can access the data at a time either to Read or to 
Write) or in semi-critical sections [5] (multiple processors 
can read a data item at a time, but if a process has to modify 
the data item, it must do so in mutual exclusion). Figures 2 
(a) and (b) illustrate both access patterns. In these Figures, 
only accesses to a specific shared datum are shown. 


4.1 Analytical Model for one S-block 


The program model is derived from the model in [17]. We 
had to extend this model because it did not capture the locality 
of references to shared writable data. In another paper [6], we 
presented two additional program models for which the effect 
of cache coherence can be solved analytically and which take 
into account the accesses made in critical sections and semi- 
critical sections. All these program models can be defined as a 
special case of the following model. The program model that we 
are about to define assumes that accesses by one processor to a 
shared writable block are done in uninterrupted bursts. Besides 
modeling critical section accesses, the burst model takes into 
account the locality of reference on shared blocks. We use the 
same notation as in [7]. 

The P processors execute independent streams of instruc- 
tions and generate homogeneous streams of references. S-blocks 
belong to different sets; all S-blocks in a set are accessed with 


the same pattern, even if they are accessed by different pro- 
cessors; the program model, model parameters and coherence 
overheads are identical for all the S-blocks in a set. 

Let q, be the fraction of references to S-blocks. The fraction 
of S-block accesses that are for a particular S-block i is p; with 
1=1,...,N, and N, is the total number of shared writable blocks. 
S-block i is shared by J; processors (J; < P, the total number 
of processors in the system). Processors access an S-block i in 


bursts. /; is the average burst size, i.e. the average number 
of accesses to the block during an access burst. An isolated 
access is counted as a burst of size one. The average burst 
size can be found by dividing the total number of references by 
the number of access bursts. For example, for the examples of 
Figure 2 the average burst sizes are 1; = 2 (Figure 2(a)) and 
I; = 1.75 (Figure 2(b)), assuming that one cache block contains 
only one data element. 

The fraction of processor references which start an access 
burst for a given S-block i is ss Fae The basic approximation 
of the model is that access bursts are independent from one 
another. When a processor completes an access burst, all the J; 
processors have the same probability of starting the next access 
burst to the shared block. We designate by W; the probability 
that the block is modified during an access burst. Because of 
the infinite cache assumption, there is no interference among 
cache accesses to different blocks and the events occurring for 
one block are independent of the events occurring for any other 
block; the state transitions of S-block i can be observed in 
isolation. 

The global state of an S-block i is described by the num- 
ber of caches possessing a copy of the block and by the status 
RO or RW of the block. The global states are denoted by 
1_RW, 1_RO, 2_RO,..., J;-RO. We can ignore the identity of 
specific processors because the multiprocessor is homogeneous 
and symmetric. 

The Markov chain for the state transitions of S-block i is 
shown in Figure 3 (we have dropped the index ¢ in the Figure 
for clarity. Note that all parameters are for a given S-block 
i). The state of the Markov chain is the global state of the 
block whenever an access burst is completed (except for the 
state MEM, which is the state of the block before the first 
reference to it). It is clear from Figure 3(a) that states MEM 
and 1_RO are transient states. Figure 3(b) shows the reduced 
Markov chain where the transient states have been removed. 
We will only solve the Markov chain of Figure 3(b). A state 
transition occurs in this state diagram every time a burst of 
accesses is completed by one processor. The transition proba- 
bilities from state k.RO, k < J;, are found as follows. 


1. From state k_RO to state k++ 1_RO: The probability of 
this transition is the product of of the probability that 
the next burst contains only Read accesses, (1—W,), and 


of the probability that the access burst is made in one of 
the J; — k other caches, (75-*). 


2. From state k.RO to state k.RO: This is the case when 
the next access burst contains only Read accesses to one 
of the k caches. The transition probability is (1 — Wi) F- 


3. From state k_RO to state 1.RW: This is the case when 
the next access burst modifies the block. The transition 
probability is W,. 


The transition probabilities from states 1.RW and J;._RO are 
derived from similar arguments. 

This finite state Markov chain is aperiodic and irreducible. 
Let’s denote by Pr(1) and by Pr(k), k=2,...,J;, the state prob- 
abilities of state 1.RW and states k_RO respectively. The state 
probability distribution is given by the set of equations: (see 
for example [1]) 7 
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(J; -k+1)(1- Wi) 
(Ji _ k)(1 — W;) + J;W; Pr(k 1), 
for k= 2,...,Jg 


Pr(k) = (1) 


and 


Ji W; 


P= GW ae 


(2) 

With these state probabilities, one can compute the proba- 
bility of occurrence of each coherence event. When there are k 
copies in k processor caches a miss occurs at the beginning of 
a new access burst, i.e. at a state transition in Figure 3(b), if 
the next processor to start an access burst is one of the (J; —k) 
processors without a copy in their cache. Therefore, the frac- 
tion of references to S-block i which miss in the cache is equal 
to the fraction of state transitions causing a miss divided by 
the average burst length /;. 


M; => Pr(1)- AD) 5 pray RP) 


l; t ¢ 


+Pr(J; —1)- A (3) 


After some transformations, one finds simply (see [9]): 


1 (J;- 1)-W;, 


M; = - 
l1+(j;- 1) -W; 


(4) 
In the Markov graph of Figure 3(b) a transition from state 


1_RW to state 1_RW results from three possible sequences of 
events: 


1. the processor owning the RW copy of the block has 
started a new burst of accesses for the same block; no 
event is recorded for S-block i in this case; 


2. a different processor has started an access burst for S- 
block i and its first access to the block is a Write; an 
event of type IN-RW must be recorded for S-block i; 


3. a different processor has started an access burst for S- 
block 1 and its first access to the block is a Read followed 
by a Write; one event of type CS_RW followed by one 
event of type IN.RO must be recorded. 


In order to differentiate between the 2nd and the 3rd cases, 
we have to introduce a new factor f;, which is the fraction 
of Write bursts! such that the first access is a Write. f; can 
easily be computed from a string of references. For example, 
in Figure 2(a), f; = .75, and, in Figure 2(b), f; = .5. 

An invalidation of RO copies occurs whenever an access 
burst modifies a block in an RO state. It also occurs in a tran- 
sition from 1-RW to 1_RW, provided the second access burst 
is executed by a different processor and starts with a Read. 
Therefore, the fraction of accesses to S-block i invalidating RO 
copies in other caches is given by: 


1 Ji 
eee 
1By definition, a Write burst is an access burst containing at least one 
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Write access. 


A change of state from RW to RO occurs whenever a burst 
leaving the block in state 1_RW is followed by a burst starting 
with a Read access by any other processor. Therefore, the 
fraction of references to S-block i changing the state from RW 
to RO is: 


+ Pr(1)W,(1- fi) 
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Finally, an invalidation of a RW copy occurs whenever an access 
burst leaving the block in state 1.RW is followed by a Write 
from any other processor. The fraction of references to S-block 
i causing such an event is therefore: 


J; -1 
J; 
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In these equations, Pr(1) is given by equation (2). 
4.2 System Effects 


The results of the previous section are combined to model 
the effect of cache coherence on the overall system performance 
under the assumptions of Section 2. Two performance mea- 
sures are derived: the miss ratio and the average coherence 
penalty. 


4.2.1 Miss ratio 


In the infinite cache model, if we neglect the transients, the 
miss rate is given by the miss rate on shared writable data, i.e., 


N, 
M = 4s >- iM; (5) 
#1 
where N, is the total number of shared writable blocks. The 
value for M; is obtained by applying equation (4) and depends 
on different values of the parameters for different S-block i. In 
many cases, the terms in the above sum can be clustered by 
grouping the shared writable blocks into sets; within a set all 
blocks are referenced with the same pattern, and therefore have 
the same value of M;. 


To find M from equation (5), one need to specify the model 
parameters for all sets of blocks. In the studies presented in 
[7,3,18], there is only one set of parameters. Implicitly, it is 
assumed that the models can be applied to a single average set 
including all the shared writable data. Parameters are there- 
fore computed as averages. For the five examples presented in 
Sections 5 and 6, this-approach has proven to be acceptable. 


4.2.2 Average Coherence Penalty 


A processor runs at maximum speed when no cache misses 
or coherence events occur. To each coherence event corresponds 
an average penalty, Aevent. The penalty associated with an 
event is defined as the average time that a processor is blocked 
at the occurrence of the event. The average coherence penalty 
per memory reference to S-block i is: 


Ay = MiAm + GiArn_ro + CiAcs_rw + DiAIN_Rw 


M;, I;, C;, and D; were defined in Section 4.1. 
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If we neglect the transients, the average coherence penalty 
per memory reference is given by the sum of the coherence 
penalties on each shared writable block i: that is, 


Ne 


>. Piri 


t=1 


(6) 


Atotal = ds 


As for the system miss rate, S-blocks can be clustered into a few 
sets in which blocks have the same average coherence penalty. 
The average penalty could also be approximated by the penalty 
for an average block. 

The average coherence penalty adversely affects the pro- 
cessor efficiency. In powerful and expensive main frame mul- 
tiprocessors, any loss of processor efficiency is critical for the 
performance/cost ratio of the system.” 


5. APPLICATION TO MULTITASKED ALGO- 
RITHMS (CACHE BLOCK SIZE IS ONE) 


If the cache block size is one data element, then the values 
of the parameters are straight forward. This section deals with 
a cache block size of one. In Section 6, we will investigate the 
block size effect. 

We compare the model predictions with the simulation of 
specific algorithms running on shared-memory multiprocessors 
in which each processor has a private data cache of infinite 
size. The behavior of the relaxation and FFT algorithms are 
data independent. To simulate these algorithms a simulation 
methodology described in [8] was applied. In this methodology, 
the algorithm is actually executed on a uniprocessor and the 
multiprocessing effect is obtained by executing the program of 
each simulated processor in turn. The simulator switches from 
one simulated processor to the next on each shared data access 
and synchronization primitive execution. A slightly different 
technique was applied to the quicksort algorithm and will be 
explained in Section 5.2. Many simulation results can be de- 
rived analytically. These analytical derivations whenever they 
are possible are presented in [9]. Also the computation of the 
model parameters for each case is given in [9]. 


5.1 Relaxation Algorithms for Partial Differential 
Equations 


We consider two iterative schemes [20]: the Jacobi and the 
Successive Over Relaxation (S.O.R.) algorithms. In the Jacobi 
iterative algorithm, the computation consists in repetitively up- 
dating each point of a grid as follows: 


1 
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The Jacobi iterative algorithm requires to maintain two 
grids. In each iteration, each point of one grid is updated by 
using the values of the 4 neighbors in the other grid. Then the 
processors synchronize and the two grids are interchanged. For 
a M x N grid, there are 2(M+N) boundary grid points and 
these points are not modified during execution: these points 


21f T, is the mean execution time of an instruction in the uniprocessor 
system (in microsecond), and if r is the average number of memory refer- 
ences per executed instruction, the average instruction execution time in 
the multiprocessor is T: +17 Atotat, and the MIPS rate (Million of Instruc- 
tions Per Second) per processor is MIPS rate = es PT : 


are Read-Only and can be treated as P-blocks. The grid points 
adjacent to these boundary points are called outer grid points. 
The reference pattern to outer grid points is different from the 
pattern to inner grid points. Consider, for example, two square 
grids of size 8 x 8 for which we allocate one processor to each 
subgrid of size 4 x 4, as displayed in Figure 4. The sets of 
shared writable data are circled in Figure 4. There are only 
three sets of S-blocks (in the sense of Section 4.1), (1) inner 
grid points with J = 2, (2) outer grid points with J = 2, and 
(3) inner grid points with J = 3. 

Shared writable data are accessed in semi-critical section 
in the Jacobi iterative algorithm. In one iteration, the shared 
data in one grid are Read-Only and are accessed by different 
processors, and in the next iteration they are modified by a 
single processor in a critical section phase. Since the models 
developed in this paper are for infinite caches in steady-state, 
we consider iterations n and n+1 where n>1 and we assume 
' that the data caches are large enough to contain the two sub- 
grids accessed by each processor. 

In the S.O.R. algorithm, only one copy of the grid is needed 
and iterates are updated according to the red/black ordering: 
grid elements which are in even positions (the sum of the in- 
dexes is even) are tagged as black, others as red. Each iteration 
proceeds in two sweeps. The red elements are updated in the 
first sweep, and the black elements are updated in the second 
sweep. After each sweep, each processor has to synchronize 
with at most 4 neighbors. Each processor has the same num- 
ber of red and of black iterates. The equation for the update 
of an iterate in the (K+1)th iteration is : 
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where w is the relaxation factor. 

During one sweep of the algorithm, some shared grid points 
are read by multiple processors, and some others are read and 
modified by one processor. As for the Jacobi iterative algo- 
rithm, there are 3 sets of shared writable grid points. 


5.2 Quicksort 


Quicksort is a divide-and-conquer algorithm, which sorts 
a file A{1], A[2], . .A[N] by rearranging it to make the 
condition that A[1],.. . .A[j-1] < Al[j] < A[j+1], . . .A[N] hold 
for some j, and by recursively applying the same procedure to 
the subfiles A[1],. . .A[j-1] and A[j+1],.. .A[N]. A program 


for the quicksort on a uniprocessor is given in Figure 5 [15]. 


In a multiprocessor, at the end of each splitting phase, one 
subfile is processed by the same processor and the other subfile 
is sent to a different processor, until all processors are busy. 
The computation proceeds in a tree-like fashion with as many 
leaf nodes as there are processors, as displayed in Figure 6. At 
the leaf of the computation tree, each processor is assigned the 
quicksorting of one subfile. 

In an infinite cache environment, coherence activity occurs 
mostly while the tree is growing. We only consider the part of 
the execution from the start of the algorithm until a processor 
has reached the bottom of the tree and has finished the first 
iteration of its local quicksort. While a processor splits a sub- 
file no other processor accesses any data item in the subfile. 
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Therefore shared data are accessed in critical sections. The P 
subfiles obtained at the leaves of the tree (P is the total number 
of processors), correspond to P different paths in the compu- 
tation tree; the data in these subfiles are shared to various 
degrees. For example, one subfile is accessed by the same pro- 
cessor from start to finish, and therefore is not shared. Other 
subfiles may be shared by J processors, J = 2,...,log2P + 1. 
P — 1 sets of shared writable data can be identified. Each set 
can be associated with a leaf in the computation tree. Figure 
6 illustrates the different sets for P = 8. 

Let’s assume that the probability of an exchange is qg dur- 
ing each splitting phase. The values of the parameters for the 
model are! => 1+ q, W = q, and J = 2,...,loggP +1. The 
exact values for the miss rate and for the average penalty can 
be computed and verified by simulation. The simulation of the 
quicksort proceeds as follows. We take a file of size N and 
scan it repetitively as in the quicksort algorithm. However, we 
do not execute the quicksort, because its behavior is data de- 
pendent. Rather, in the simulation, shared data accesses are 
generated as follows. Every time an element of a subfile is 
visited, we decide to exchange with probability g; a subfile is _ 
always split into two equal halves. 


5.3 Fast Fourier Transform (FFT) 


The one-dimensional non-shuffling FFT algorithm for N 
data items is represented by a butterfly graph with log2N 
stages. A bit-reversal permutation is applied at some point of 
the algorithm, so that the results are stored in the same order 


as the initial data items. Let s(k),k=0,1,2,...,N-1 be N samples 
of a time function. The DFT (Discrete Fourier Transform) of 
s(k) is defined to be the discrete function x(j),j=0,1,2,...,N-1, 
where 


N-1 
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where j=0,1,..N-1 and i= /—1. 

In the non-shuffling FFT slsontam, we divide tie array of 
N items into P chunks containing x consecutive dees Each 
processor computes the FFT for its chunk, containing x data 
items. For N=16 and P=4, the non-shuffling FFT algorithm 
is illustrated in Figure 7. In general, each block is shared by 
log2P +1 processors and the algorithm can be divided into two 
parts. In the first part, i.e., the first log, stages of the but- 
terfly, every shared block is accessed by one processor. In the 
second part, i.e., in each of the last logeP stages of the butter- 
fly, each shared block is first read by two processors and then 
modified by a single processor in a critical section. Since there 
is no coherence activity in the first part of the non-shuffling 
FFT algorithm, we examine the second part of the algorithm. 
Synchronization is necessary in this algorithm and is denoted 
by dotted lines in Figure 7. In general, 2 loggP synchroniza- 
tion points are needed in the second part (If the algorithm used 
two copies of the array and alternated between the copies only 
log2P synchronization points would be needed.) There is only 
one set of shared writable data in this problem. 


Another algorithm for FFT in multiprocessors is the shuf- 
fling FFT. In this algorithm, computations of partial FFTs 


alternate with shuffling stages in which data are passed among 
processors. Figure 8 presents the shuffling FFT algorithm for 
an example where N=16 and P=4. During each butterfly com- 
putation and each shuffling stage, each shared block is read and 
updated by a single processor. There is only one set of shared 
writable data and each data is shared by J = 2 processors. 


5.4 System Effects 


Table 1 records the overall miss rate on data and the av- 
erage coherence penalty for the five algorithms. The unit for 
the penalties is the average penalty for a miss. We have cho- 
sen the following penalties for each event: Am = Acs_Rw = 
\rn_Rw = 1, Ar1n_ro = 0.5. There are three numbers per 
entry in the Table. The first one is obtained summing by ap- 
plying the model to each set and by the contributions of each 
set. The second number is obtained by computing the aver- 
age values of the model parameters and by applying the model 
with these average values. These two numbers are very close 
because there is only one set of data, or one set dominates, 
or all sets are accessed with the same probability. The third 
number is obtained by simulation. Remember that these per- 
formance estimates apply to the part of the algorithm where 
sharing of writable data occurs, and that the first miss and its 


associated penalty are not counted. 
6. EFFECT OF CACHE BLOCK SIZE 


Cache block size is an important factor that affects the sys- 
tem performance. When a cache block contains more than one 
datum, access bursts to an S-block have different characteristics 
and in general, it is much more difficult to apply the models. 

In this section, we show how the cache block size, B, affects 
the model and simulation results. 

We chose the following penalties rate for each coherence 
event: Aw = Acs_rw = AIN_RW 0.75 + 0.25B and 
AIN_ROo = 0.5. We therefore model the penalty for a block 
transfer by a simple linear function of the block size. 


6.1 Relaxation Algorithms for Partial Differential 
Equations 


In the two iterative algorithms, when the cache block size 
increases, the number of sets of S-blocks also increase. For 
instance, there are five sets of S-blocks when B is two, eight 
sets of S-blocks when B is four, and ten sets of S-blocks when B 
is eight. For the Jacobi iterative algorithm, Figure 9 illustrates 
the eight different types of S-blocks, named type 1 to type 8 
when the cache block size is four data elements. 

In the case of an M x M Jacobi array, when the cache block 
size exceeds “£ + 1, processors update S-blocks alternatively, 
and hence the number of references / in each burst is equal to 
one; J can be as high as 2- \/P or even P. 

We have identified all sets and we have applied the model to 
each set for block sizes from 1 to 256, and for a grid size of 128 
xX 128. Figures 10 and 11 show the comparison between model 
prediction(dotted curve) and simulation (plain) for the system 
miss ratio and the system penalty (obtained by summing the 
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contributions of all sets). These curves are valid for any number 


of processors P provided B < -M_ +41. These two Figures show 


VP oe 
that the analytical program model yields very good predictions 
for the Jacobi iterative algorithm when the cache block size is 


greater than two (error is less than 2%). 


In the S.O.R. algorithm, for any cache block size, the num- 
ber of sets of S-blocks are the same as in the Jacobi iterative 
algorithm; however, the reference patterns to the blocks in the 
sets are different. 

In the case of an M x M S.O.R. array, when the cache 
block size exceeds a +1, S-blocks are updated alternatively 


by different processors. In this case, J can be as high as 2: VP 
or even P. 

Figures 12 and 13 illustrate the results of the system miss 
ratios and system penalities for different cache block sizes for 
a 128 x 128 grid. These curves are independent of the number 
of processors provided B < 4 +1. The model (dotted) is 
compared to the simulation (plain). Again, the Figures show 


that the model is very reliable. 
6.2 Quicksort 


For simplicity we have considered only the tdeal quicksort: 
the number of processors and of elements in the file is a power of 
2 and each split is perfect, i.e. a subfile of size n is exactly split 
in two subfiles of size $. When the number of data elements in 
one cache block is less than or equal to x where N is the total 
number of data elements, a cache block can only be referenced 
by one processor in a splitting phase. 

Figures 14 and 15 show the results of the system miss ratio 
and the system penalty for the model (dotted) and for the sim- 
ulation (plain). In these simulations, the file size was N=64K, 
and the number of processors was 8. Different curves would be 
obtained for different number of processors. The relative error 
in these two Figures is less than 20%. 


6.3 Fast Fourier Transform (FFT) 


In the non-shuffling FFT algorithm, there is only one set of 
S-blocks. When the number of data elements in a cache block 
is less than or equal to * each S-block is shared by loggP + 1 
processors. 

When the number of data elements in one cache block ex- 
ceeds x S-blocks bounce back and forth among processors at 
every Write. In this case, each block is shared by more than 
log2P +1 processors. 


In the shuffling FFT algorithm, there is also only one set 
of S-blocks. When the number of data elements in a cache 
block is less than or equal to x each S-block is shared by two 
processors. Otherwise, when the number of data elements in 
one cache block is larger than x S-blocks move back and forth 
among processors. 

Figures 16-19 show the results for non-shuffling and shuf- 
fling FFT algorithms. The file size is 64K and the number of 
processors is four. These curves would be different but would 
have the same shape for larger number of processors. The 
relative errors between model predictions and simulations are 
between 20% and 30% for the system miss ratio, and between 


5% and 20% for the system penalty. 


7. DISCUSSION OF RESULTS 


It has been observed that the combined effects of critical 
sections (for all block sizes) and of the spatial locality [16] of 
accesses (for block sizes larger than one) to shared writable 
data result in access bursts to such data by different proces- 
sors. This is the basic premise of the paper. Based on this 
observation, we have extended a previous program model for 
the sharing of data, and we have tried to match the model pre- 
dictions and the predictions of simple simulations of algorithms 
in multiprocessors with infinite caches. 

It appears that. iterative algorithms such as the Jacobi or 
S.O.R. are very well suited to cache-based systems with large 


data caches, because shared data contention is low (in realistic 
cases, the number of processors sharing a given writable block 
is less than four and the fraction of accesses to shared writable 
data is low). Figures 10-13 show that bigger block sizes do 
not improve the overall hit rate on shared data and cause more 
penalty: the average miss rate on each S-block access decreases 
(i.e. M; decreases) but the number of accesses to such blocks 
increases (i.e. g, increases); the probability of a coherence event 
per access to S-blocks decreases, but this is more than compen- 
sated by the increase of g, and of the penalty associated with 
each coherence event. For shared data accesses, the block size 
should be small (one or two data elements). Note that this 
conclusion is only valid if the caches are large enough to con- 
tain all data across successive iterations; moreover, the first 
iteration causes large number of misses for the initial load of 
data and instructions. These transients are helped by a bigger 
block size. From the Figures, we observe that a block size of 
16 data elernents is acceptable for shared data accesses. These 
conclusions are valid for many configurations of processors as 
explained in Section 6.1. 

The results on the quicksort are somewhat artificial because 
we have simulated the tdeal quicksort only, to simplify. In prac- 
tice, one would have to run simulations for multiple random 
input files and take averages. Nonetheless, the stochastic sim- 
ulation does represent one possible execution of the quicksort. 
Bigger block sizes are a definite advantage in the ideal quick- 
sort algorithm, up to a size of B = 16. If the algorithm used to 
estimate the median is good, this conclusion should hold also 
in the general case (we expect however more contention). 

Bigger block sizes also improve the performance of the FFT 
routines (strangely enough up to a block size of 16 data el- 
ements, again). While the penalties on individual coherence 
events increase with the block size, the fraction of shared-data 
accesses causing these events decreases and the total number 
of accesses stays constant, as the block size increases. 

In all the simulations we ran, there is a maximum block size 
beyond which the performance drops sharply. This block size 
depends on the size of the problem and on the decomposition 
of the algorithm (i.e. the number of processors). 

From Tables 1-6 and from Figures 10-18, it appears that 
for the five algorithms studied in this paper, the precision of 
the model based on the idea of access bursts is good in many 
cases. We never expected the models to fit exactly each case: 
because of the large data reduction in the stochastic models, 
a given model with given parameter values maps on different 
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algorithms with different behaviors. It appears however that 
the models and their parameters are sufficient to approximate 
the shared data contention effect for some important parallel 
algorithms, and in the case of the infinite cache model. 

If we look at the comparisons between model and simula- 
tions, it appears that the quicksort and the non-shuffling FFT 
result in the worst predictions; in the case of the non-shuffling 
FFT, the model predicts that the coherence overhead will in- 
crease with P, the number of processors, while the simulations _ 
predict that it remains constant. A closer look at Figure 7 
shows that an S-block is not shared by all log2P +1 processors 
at all times but rather that it is shared by different groups of 
two processors at different stages of the computation. Applying 
the model with J=2 would yield a much improved prediction 
of the model. 

We have assumed all along that preemptions and migra- 
tions of processes were disallowed. Indeed, in all the algorithms 
we have studied, processes are statically scheduled. We mostly 
made this assumption to simplify the analysis of the algorithms. 
However, in many cases, the bursty behavior of accesses would 
be preserved if preemption, migration, and dynamic schedul- 
ing were allowed. The reason is that bursts of accesses are 
short and unlikely to be interrupted by preemption. Migration 
and dynamic scheduling would increase the randomness in the 
selection of the processor to start the next access burst, and 
therefore would alleviate the problem observed in the quick- 
sort and non-shuffling FFT. While the parameters W, f andl 
would remain the same as in this paper (assuming that time be- 
tween preemptions is very large compared to the average burst 
time), the parameter J would have to be different. If migration 
is allowed, then private writable data will behave like shared 
writable data and the model could be applied to private data, 
as well. 


8. CONCLUSIONS 


In this paper, we have presented and solved a simple model 
for the caching of shared writable data in multiprocessor sys- 
tems executing parallel algorithms. The simplicity and gen- 
erality of the results stem from the infinite cache hypothesis. 
The infinite cache model is independent of all cache parameters 
(e.g, organization or replacement policy). 

There are many extensions possible to this work. First of 
all, one could analyze the behavior of more parallel algorithms 
and compare it to the model predictions. One could investi- 
gate the effect of migration, preemption or dynamic scheduling. 
Some parameters in the model, such as I, are easier to estimate 
directly than others [9], such as J when migration and pre- 
emption is allowed. Besides using the results of simulations or 
measurements to estimate these parameters, one can use the 
model to derive upper bounds (for example, if J=P or if W=1, 
then the miss ratio obtained through equation (4) is an upper 
bound.) Finite-cache effects should be studied. 

Finally, given the simplicity of the program behavior mod- 
els, one can derive simple results for proposed coherence proto- 
cols in order to compare their effectiveness in handling shared 
writable data [19]. 
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Figure 2: (a) Access pattern to a shared writable datum protected by critical sections. (b) Access pattern to a shared 
writable datum protected by semi-critical sections. (tj: Read access to shared datum X by processor j, W;: Write access 


to shared datum X by processor j.) 


RG) WG) 
Figure 1: State diagram for a given block in cache i (infinite 
cache assumption)(R(i) : Read block by processor i, R(j) : 
Read block by processor j, W(i) : Write to block by processor 
i, W(j) : Write to block by processor j.) 
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Figure 4; The three data sets in the Jacobi iteration. 
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Figure 3: (a)Markov chain for the state transitions of an S-block shared by J pro- 
cessors (including transient states). (b)Markov chain for the state transitions of an 
S-block shared by J processors (without transient states). 
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Figure 7: Non-shuffing FFT 
algorithm for P=4 and N=16. 
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Procedure quicksort (l,r:integer); 
var n,t,i1,j : integer; 


begin 
ifr > 1 then 
begin | 
n:=alr]; i:=1-1; j:=r; 
repeat 
repeat i:=i+1luntil ali] > n; 
repeat j:=j-luntil ali] < n; 
t:=ali]; a[i]:=alj]; alj]:=t; 
until j <i; 
a|j]:=ali]; a[i]:=al[r]; a[r]:=t; 
quicksort(1,i-1); 
quicksort(i+1,r) 
end 
end; 


Figure 5: Program for the uniprocessor quick- 
sort. 
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- Figure 8: Shuffling FFT algo- 
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Table 1: System Effects (block size of one) 


(1) Model using average values of the parameters 
(2) Sum of the contributions of each set (model) 
(3) Sum of the contributions of each set (simulation) 
In the following Table, parameters, J, W, 1, f, are average values. 
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Figure 11: The system total 
penalty for Jacobi iteration algo- 
rithm 


Figure 10: The system miss ratio 
for Jacobi iteration algorithm 
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Figure 13: The system total 
penalty for S.O.R. iteration algo- 
rithm 


Figure 12: The system miss ratio 
for S.O.R. iteration algorithm 
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Figure 14: The system miss ratio 
for quicksort algorithm 


Figure 15: The system total 
penalty for quicksort algorithm 
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Figure 9: The eight sets of S-blocks in the Jacobi iteration 
when B is equal to four.(M=16, P=4) 
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Figure 16: The system miss ratio 
for Non-shuffling FFT algorithm 
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Figure 19: The system total 
penalty for Shuffling FFT algo- 
rithm 


Figure 18: The system miss ratio 
for Shuffling FFT algorithm 
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Abstract 


The multiprocessor is a powerful medium for con- 
ducting empirical studies of parallel processing techniques 
and architectures. Critical to the success of this approach 
is the nature, detail, and accuracy of the measurements 
acquired to evaluate system behavior. While software 
methods of determining behavior characteristics are flexi- 
ble, they are intrusive, perturbing the operation of the sys- 
tem and yielding measurements of marginal accuracy. This 
paper describes two hardware instruments developed for 
multiprocessor behavior analysis that help circumvent the 
intrusive properties of software measurement techniques. 
One instrument, DLA, measures memory access latencies 
and delays due to contention for shared resources. Another 
instrument, SySM, monitors processor software activity, 
providing execution profiling statistics. This paper discuss- 
es three examples of their use in parallel processing 
research on the Concert Multiprocessor Testbed. 


1. Introduction 


The future of high performance computing will rely on 
current research into parallel processing architectures and 
techniques. Effective tools are needed to explore parallel 
program characteristics and parallel architecture behavior. 
Two major classes of tools currently being used are simula- 
tors and prototype parallel systems. 


Simulators[1][2][3] are widely employed due to 
their flexibility and ease of modification. Unfortunately, 
they suffer from innate slowness, restricting the applica- 
tions that can be run on the simulated target system. 
Another tool is a prototype system on a parallel computer: 
experimental and commercial multiprocessors[4] and 
SIMD[5] machines capable of executing significant parallel 
algorithms in acceptable time. While these systems enable 
progrdmmers to run significant applications and make 
coarse measurements with software, they do not permit 
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easy evaluation of behavioral details. Software support for 
detailed measurements is intrusive, perturbing the behavior 
of the parallel system by the act of measurement. 


Software simulation can provide detailed traces of 
any activity within a modeled system, often producing enor- 


mous amounts of information. Real time instrumentation 


cannot access all of the elements of a system; it does not 
have the time and storage resources available to collect 
exhaustive traces in a nonintrusive manner. Fortunately, 
some experiments only require a simple set of statistics, 
rather than a time domain trace. In these cases, instru- 
ments can be devised that record these statistics instead of 
going through the intermediate step of acquiring a time 
trace. 


Harris Corporation’s Advanced Technology Depart- 
ment and the MIT Laboratory for Computer Science have 
each developed a version of a multiprocessor called Con- 
cert. Concert[6] features embedded instrumentation, allow- 
ing parallel processing experiments to be performed in real 
time with a minimum of intrusion. The nature and functional- 
ity of Concert’s instrumentation and examples of its use in 
parallel processing research will be given. 


SySM and DLA are instruments in Concert which 
measure performance loss parameters. SySM[7] provides 
hardware support for monitoring software behavior in a 
nearly nonintrusive manner. The programmer divides the 
application program into a set of mutually exclusive seg- 
ments. As execution shifts from one segment to another, 
the application informs SySM of the transition. SySM accu- 
mulates the number of entries and the total time spent in 
each segment, and tracks nested interrupts. DLA[8] pro- 
vides nonintrusive measurements of the time lost due to 
memory access latency and contention for shared communi- 
cation channels. It determines the amount of time each pro- 
cessor spends waiting for access to shared buses, the 
amount of time it spends accessing hierarchical memory, 
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Figure 1. A Typical Concert Cluster 


and the number of times atomic test-and-set operations are 
performed. SySM and DLA provide a powerful mechanism 
for observing those elements of a multiprocessor that result 
in performance loss. 


Three different research projects have utilized Con- 
cert, SySM and DLA: the SC] 9] parallel execution envi- 


ronment, an interpreter for the Multilisp[10] language, and 
an emulator of the YARC static dataflow architecture[11]. 
These experiments illustrate how SySM and DLA can be 
used to reveal the behavior of both the application software 
and the multiprocessor hardware. The disparate functionali- 
ty of the tools is used to advantage, as each application 
uses the tools in a different manner. 


2. Background 


The development of SySM and DLA was based 
upon the needs of the researchers investigating parallel per- 
formance degradation, and was shaped by the architecture 
of Concert Multiprocessor. The details of Concert, and the 
approach used to analyze degradation, played a significant 
role in the design and subsequent use of SySM and DLA. 


2.1. The Concert Multiprocessor Testbed 


The Concert multiprocessor was developed as a 
flexible facility for empirical research in the field of parallel 
processing. Two versions of Concert have been implement- 
ed: the first at the MIT Laboratory for Computer Science 
and the second at the Advanced Technology Department of 
Harris Corporation’s Government Systems Sector. They 
are logically equivalent, supporting the same software envi- 
ronments and executing a shared base of applications. The 
two Concerts differ in the means by which global memory is 
shared and system wide communication is performed. Con- 
cert has been used to study parallel algorithms, languages, 
computing models, system run time strategies, and parallel 
computer architecture. The Concert environment consists 
of the multiprocessor hardware, a message passing library 
for parallel! program development, and a set of utilities sup- 
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porting local area network communication and a disk file 
system. 


Concert is a tightly coupled shared memory multipro- 
cessor. It incorporates up to 64 MC68000[12] microproces- 
sors, organized in eight clusters of up to eight processors 
each. Each processor has 512 Kbytes of local memory; the 
system also has 8 Mbytes of global memory. In addition, a 
set of globally accessible registers provide interprocessor 
interrupts. The eight clusters are connected with the global 
memory and registers by one of two communication mecha- 
nisms. | 


The cluster is the basic unit of the Concert multipro- 
cessor. It contains up to eight processors, each with its 
own local memory connected via a private high speed bus. 
Some clusters include disk controllers for secondary stor- 
age and Ethernet interface boards. All boards within the 
cluster use a common Multibus. Each cluster contains a 
global memory interface board. Each cluster also contains 
DLA and SySM instrumentation hardware. A typical Con- 
cert cluster is shown in Figure 1. 


The MIT Concert uses a dynamically segmented 
RingBus to interconnect eight clusters of four processors 
each. Each cluster holds 1 Mbyte of the 8 Mbyte global 
memory. Processors accessing global memory within their 
cluster use the cluster’s internal bus; memory in other clus- 
ters is accessed via the RingBus. A central arbiter pro- 
cesses requests for RingBus access from the clusters, 
establishing non-overlapping paths between the requesting 
clusters and the desired global memory segments. A clus- 
ter is attached to the RingBus by means of the RIB 
(RingBus Interface Board) which also contains the relevant 
subset of the system global registers. The system level 
architecture of the MIT Concert multiprocessor is shown in 
Figure 2a. 


The Harris Concert employs a conventional cross- 
bar switch to connect its eight clusters of eight processors 
each to the system’s global memory and registers. The 
memory is organized in a 16-way interleaved structure of 
512 Kbyte blocks. This interleaving reduces memory con- 


tention by distributing memory references across the 
blocks. An additional block contains the global system reg- 
isters. The architecture of the Harris Concert multiproces- 
sor is shown in Figure 2b. 


2.2. Performance Degradation 


The performance of a multiprocessor is intimately 
coupled to the factors which contribute to performance 
degradation. These factors cause the actual performance of 
a multiprocessor to deviate from the ideal case. By quanti- 
fying and reducing these losses, overall performance can be 
‘improved. There are four general sources of loss which 
must be observed to characterize a multiprocessor perfor- 
mance. These are: 

Starvation the time a processor is idle due to inade- 
quate parallelism in the application pro- 
gram. 


Contention the delay experienced by a processor 
attempting to obtain exclusive access to 
a shared resource already in use. Two 
processors accessing the same queue, 
for example, must serialize their access- 
es in order to retain queue integrity. 


Overhead the work that must be performed by a 
processor to manage the application’s 
parallelism. This work would not be per- 
formed by a uniprocessor executing the 


same application. | Overhead includes 


synchronization, task creation, and task 
scheduling. 


Latency the time required to access distant mem- 
ory objects in systems with distributed 
communication and memory. This 
includes the impact of cache misses due 


to frequent context switching. 


Measuring each of these losses permits the characteriza- 
tion of the multiprocessor’s performance model. 


3. Embedded Instrumentation 


Hardware support for efficient monitoring of multi- 
processor system behavior has been realized in two embed- 
ded instruments within the Concert Multiprocessor 
testbed. Two classes of behavior are examined: contention 
and latency at the hardware level, and starvation and over- 
head at the software level. 


3.1. DLA: Monitoring Contention and 
Latency 


Performance loss due to hardware _ operation 
includes contention for access to shared physical resources 
and latency of access to nonlocal objects. Contention in 
Concert occurs when multiple processors require access to 
the same Multibus, RingBus segment, or global memory 
block simultaneously. Losses resulting from latency occur 
when processors must access data in global memory, which 
has a slower cycle time than local memory. Neither of 
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Figure 3. The DLA Block Diagram 


these losses occurs in a uniprocessor; such degradation can 
be attributed to parallel processing. The DLA 
(Degradation due to Latency and Arbitration) hardware 
installed in each cluster of Concert measures the time each 
processor spends waiting for and using the Multibus, and 
accessing memory. 


DLA measures several aspects of Multibus activity, 
including: 


Free Time __ the time in which the Multibus is idle. 


Wait Time time a processor spends contending for 
use of the Multibus. 


Access Time time the processor spends using the 
bus. DLA also counts read and write 
operations within each block of global 
memory. 


Global memory latency is the global memory cycle | 


time, plus global communication arbitration time, plus con- 
tention for an individual global memory block. On Concert, 
the loss resulting from memory block contention can be 
determined by measuring the uncontended cycle time and 
subtracting this from the total memory access time. 


Another source of loss is contention for shared data 
structures protected by some mutual exclusion discipline, 
such as atomic test-and-set operations. Although this 
loss is not a type of hardware degradation, DLA can be 
used to estimate it. DLA counts the number of successful 
and unsuccessful TAS operations for each processor. The 
ratio of failed to successful TAS operations gives an indica- 
tion of the amount of performance loss caused by contention 


for shared data structures. 


The functional block diagram for DLA is shown in 
Figure 3. Information is acquired from two external 
sources: the Multibus and its arbitration unit. DLA has 
four parts: free time measurement, bus contention measure- 
ment, bus cycle statistics, and DLA control. These are 
depicted with their external and primary internal connec- 
tions. 


The free time module monitors the busy signal from 
the Multibus and determines the amount of time the bus is 
not in use. The contention module maintains separate 
timers dedicated to each bus master to measure the amount 
of time lost by each processor waiting for the bus. An 
extended measurement range is obtained by incrementing 
counters in RAM when a particular timer overflows. A 
timer is active when its respective processor has requested 
the bus but has not been granted master status by the 
arbiter. Any number of timers can be active simultaneously 
depending on the traffic density. 


The majority of measurements come from the bus 
master statistics module. Only one processor is involved 
at a time since only one can be master of the bus at a time. 
The measurements to be updated in the statistics memory 
are determined by the current bus master, the memory 
block accessed, and the type of access. Both the event 
counter and the accumulated time for the specific event type 
are modified. In addition, the TAS detect module senses if 
a compound test-and-set operation is being performed and 
updates the appropriate counter based upon its result. 


The user control and interface module initializes the 
DLA, starts and stops each experiment, and collects the 
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Figure 4. The SySM Block Diagram 


results. Measurement intervals can be started and stopped 
independently on a per processor basis. Between succes- 
sive measurement intervals, statistics values can be reset 
or allowed to accumulate. 


3.2. SYSM: Monitoring Software Behavior 


SySM (System Software Monitor) is an instrument 
for monitoring the behavior of software running on individual 
processors within a tightly coupled multiprocessor. Whiie 
the DLA can be used with no software modifications, the 
SySM requires a small amount of interaction with the appli- 
cation code. A program is divided into a set of mutually 
exclusive segments. At any point in time, the program is in 
one of these segments. When the program passes into a 
new segment, an instruction included by the programmer 
sends a single word message to SySM identifying the new 
segment being entered. SySM tracks the amount of time 
each processor spends in each segment and the number of 
times each segment is entered. The result is a profile of the 
computation in terms of the user defined segments. 


A SySM board is installed in each cluster of the 
Concert multiprocessor, monitoring the software behavior of 
processors in the cluster. SySM consists of eight banks of 
segment registers, a control module, a time base, and exter- 
nal interfaces, as shown in Figure 4. Each bank of segment 
registers is dedicated to a specific processor in the cluster. 
SySM interfaces to the cluster’s Multibus and the bus 
arbiter. The control module accepts state transition com- 
mands from the Multibus and modifies the appropriate reg- 
ister bank accordingly, using the current time base value. 


A bank of segment registers is dedicated to each 
processor. Each bank is divided into 255 segments, each 
segment containing three registers. These registers 
include number of entries, time in segment, and number of 
interrupts. The number of entries register indicates the 
number of times the segment was entered. The time in seg- 
ment register is the total time spent in the segment by the 
processor. The number of interrupts register shows the 
number of times the processor was interrupted while in that 
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the source of a particular behavior characteristic. 


program segment. 


There are four additional registers in each bank. 
These are the current segment register, the time entered 
register, the number of nested interrupts register, and the 
time in interrupt register. The current segment register 
indicates the program segment that is being executed. The 
time entered register contains the value of the time base at 
the time of the most recent segment change. The number of 
nested interrupts register indicates the current interrupt 
nesting level. The time in interrupt register indicates the 
total time spent servicing interrupts. 


Unlike other profiling methods that produce stochas- 
tic measurements[13], the data obtained from SySM is 
deterministic. SySM permits this data to be collected in an 
almost non-intrusive manner, causing little perturbation to 
system behavior. SySM provides measurements with one 
microsecond resolution; the intrusion of each SySM access 
is 3.2 microseconds. Segment as small as 20 microseconds 
can be measured accurately. The user defines and can easi- 
ly alter the definition of segments in order to narrow down 
This is 
useful in determining the amount of overhead incurred by 
the executing program. Overhead can be isolated from use- 
ful work in order to measure its impact on overall perfor- 
mance. 


4. Examples In Parallel Processing 


SySM and DLA have been used in a variety of paral- 
lel applications. These applications range from implementa- 
tions of high-level parallel applications to a low-level simu- 
lation of a static data flow architecture. The differences in 
these applications serve to illustrate the range of applicabil- 
ity the SySM and DLA provide. 


4.1. SoC 


The SoC project is the implementation of the Simul- 


taneous Pascal programming language[14][15] on the Con- 
cert Multiprocessor. The project encompasses the design 
and implementation of several significant software compo- 
nents, including compilers and language tools, user inter- 
face software for the Concert host, and machine dependent 
runtime support software which runs on the Concert hard- 
ware. Programs written in Simultaneous Pascal are com- 
piled on the host machine, downloaded to Concert, and exe- 
cuted. Statistics are gathered and displayed by the runtime 
software. 


Simultaneous Pascal is an extension of standard 
Pascal[16], providing the programmer with a set of explicit. 
parallel control structures. These parallel statements 
include forall, allowing homogeneous parallelism, fork, 
allowing heterogeneous parallelism, and traverse, allow- 
ing parallel access to dynamic data structures. In addition, 
fine grained scoping and parallel expression evaluation are 
supported. The threads created by these parallel con- 


structs are dynamically scheduled by the underlying runtime 
support software. 


The compiler translates Simultaneous Pascal pro- 
grams into MC68000 object code, intermixed with calls to 
library routines which create, schedule, synchronize, and 
destroy threads. These library routines are the heart of the 
SpoC runtime system, and map the virtual Simultaneous 


Pascal machine onto the physical Concert multiprocessor 
hardware. The effective performance and scalability of 
Simultaneous Pascal applications depends upon the effi- 
cient implementation of these routines. 


SySM was instrumental in analyzing and tuning the 
SoC runtime library. A goal of the SpoC development team 


was to reduce the overhead per thread to under 100 
microseconds. Using SySM, the time spent in each portion 
of the runtime library was determined. In some cases, each 
routine was then subdivided into smaller parts, as short as 
ten microseconds, and SySM was used to measure the time 
each processor spent in each routine segment. The devel- 
opers could then focus on those segments which were exe- 
cuted most often, and would yield the biggest payoff in per- 
formance tuning. The average segment times for the run- 
time package shown in Figure 5 indicate that this tuning 
Operation was a success. The two critical routines, task 
fetch and do join, together execute in just 88 microsec- 
onds. 


Application programmers can take advantage of 
SySM_ instrumentation via compiler directives.[17] The 
compiler generates instructions which access SySM during 
execution, and the runtime software gathers and displays 
the desired statistics when the application has completed. 
Such profiling provides detailed insight into how parallel 
applications perform, allowing application programmers to 
evaluate algorithms without worrying about the machine 
level details of SySM access. For example, a parallel appli- 
cation involving digital image processing was thought to be 
constrained by serialized access to global memory, and 
SySM was used to time the portions of the application 
which accessed the image data. A second version of the 
application was coded, exploiting locality by copying por- 
tions of the image into memory local to each processor. 
SySM data showed -that the global version of the applica- 
tion ran faster, since the cost of copying the image exceed- 
ed the time saved by the local accesses. Without SySM 
‘instrumentation to analyze individual statement execution 
times, such insight could only be guessed at, rather than 
determined empirically.[18] The execution times of a typi- 
cal instrumented application are shown in the second por- 
tion of the table in Figure 5. In particular, this data shows 
how programmer defined segments appear within the dis- 
play of SySM statistics. 


The SpoC system also uses SySM to derive comput- 


ing profiles of executing applications.[19] A computing pro- 
file allows the programmer to determine how many proces- 
sors are active at any point during execution, and to 
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Processor 07: (all times in microseconds) 


State Entered Time Average Percent 
System 
Awaiting Work 358 11257 31.44 OT 
Task Fetch 358 11677 32.62 0.18 
Executing 372 781204 2100.01 11.97 
Do Fork 0 0 " 0.00 
Do Forall 7 30560 4365.71 0.47 
Do Traverse 0 0 0.00 
Do Join 364 20242 55.61 0.31 
Application 
Thin Pixel 364 2717114 7464.60 41.62 
Neighbors 4150 866067 208.69 1327 
Pattern 4150 1526713 367.88 23.39 
Condition C 4150 279607 67.38 4.28 
Condition D 4150 283728 68.37 4235 
** Total ** 6528169 


Figure 5. SpoC Execution Statistics Generated by SyYSM 


determine what each processor is doing at any time. In 
order to derive a computing profile, each processor records 
the time (provided by SySM) each state transition occurs. 
After execution, the runtime software collects this data and 
passes it to the Concert host machine. Subsequent pro- 
cessing by host based tools yields the computing profile 
and processor activity chart shown in Figure 6. The com- 
puting profile relates processor utilization to time, and the 
activity graph uses various shadings to indicate which state 
of a processor at each moment. These graphs are invalu- 
able for analyzing the performance and behavior of parallel 
applications, and allow the programmer to see which parts 


of his application are serialized, reducing effective scalabili- 
ty. 


The DLA hardware allows the S,oC implementors to 


determine the effect of various queuing strategies on the 
global memory access time. The scalability of Simultane- 
ous Pascal applications depends upon the efficiency of the 
runtime software, which can degrade due to excessive 
memory contention. Different queuing strategies (varying 
the number and location of queues), coupled with tech- 
niques which reduce the number of memory accesses (such 
as exponential falloff and interprocessor interrupts) will 
alter memory access patterns and affect contention. The 
DLA provides immediate feedback about changes in con- 
tention and access times, and allows the implementors to 
obtain quantitative results which document the effects of 
each runtime system modification. 


Application programmers can obtain DLA statistics 
from the runtime package, and can use the resulting data to 
determine how object distribution in Concert’s hierarchical 
memory affects application performance. DLA provides 
information about activity within each global memory bank, 
giving insight into how well data objects were distributed 
throughout global memory. Often, parallel algorithms rely 
upon locality of reference to exploit the available paral- 
lelism, and DLA provides insight as to how effectively the 
programmer (and the language tools) have exploited the 
locality available in Concert. 


Active Processors 


1258 
Time (microseconds) 


Time (microseconds) 
Figure 6. Computing Profile (top) and Activity Graph 
generated by SPoC using SySM 


4.2. Multilisp 


Multilisp is an extended version of the Scheme pro- 
gramming language [20] with explicit. parallel constructs. 
The principal such construct is the future. (future x) 
creates a task to evaluate x and returns a placeholder—a 
future—for the result of x. When this result is computed, it 
replaces the placeholder; the future is then said to be deter- 
mined. Meanwhile, the original task may continue execu- 
tion. If a task attempts to perform a strict operation—one 
which requires a value, not a placeholder—on an undeter- 
mined future, the task is suspended and placed on a queue 
of tasks awaiting determination of the value. These tasks 
are activated when the future becomes determined. 


Multilisp programs are compiled to a stack oriented 
machine language called MCODE which is then executed 
by an interpreter written in C and assembly language. Each 
processor runs an identical copy of the interpreter code. 
Further information is available in [21]-[24]. 


The MCODE interpreter running on Concert has 
been instrumented with SySM to determine the overhead 
associated with futures. The simple expression (touch 
(future nil)) was used as a basis for data collection. 
touch is a strict identity operator: it returns the result of 
evaluating its argument. If the result is an undetermined 
future the task is suspended until the future is determined. 
The average time for various future operations computed 
from SySM data collected while evaluating (touch 
(future nil) ) is shown in Figure 7. 


The data collected by DLA for one processor in clus- 
ter five during the execution of a Multilisp application is 
shown in Figure 8. Several aspects of Multilisp behavior 
are revealed. First, the large fraction of accesses to cluster 
five’s global memory indicates that the Multilisp implemen- 
tation possesses a fair degree of locality. The associated 
access times are relatively small because the processors in 
a cluster have direct access to the cluster’s global memory 
via the Multibus. Second, the large number of accesses to 
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cluster seven is due primarily to MCODE instruction fetch- 
es. MCODE instructions are stored in the heap in global 
memory. The location of these instructions varies due to 
garbage collection activity. Consequently, Multilisp perfor- 
mance varies unpredictably, depending on the cluster in 
which the instructions reside. A “hot spot” caused by 
large number of accesses to cluster seven leads to the large 
access time for the downstream clusters (clusters zero, 
one, and two). Third, many of the accesses to cluster zero 
(particularly the TASes) are for global Multilisp and Con-. 
cert system information. Finally, the access time tends to 
increase with the number of RingBus segments required for 
the access. In this case, the cluster seven hot spot 
obscures this trend. 


The ratio of state transitions for the touch and 
ffib examples discussed earlier is about 2500 and 8700 
transitions per second, per processor, respectively. This 
yields an average time penalty of 3% (for touch) and 7% (for 
ffib) due to SySM segment transition commands. 


The SySM and DLA have allowed concrete identifi- 
cation of the time inefficiencies in the Multilisp implementa- 
tion and have been instrumental in speeding the execution 
of Multilisp programs by a factor of two to three. For exam- 
ple, an earlier version of the implementation suffered from 
frequent accesses to global information in cluster zero. A 
significant improvement in performance was achieved by 
minimizing the global information stored in cluster zero. 
The severity of the RingBus contention resulting from the 
original centralization of this information was not realized 
until DLA data was available. | 


4.3. YARC 


The YARC project encompasses the development of 
a practical static dataflow system[25][26]. Given the 
small number of physical implementations of dataflow archi- 
tectures and the resulting inadequate understanding of their 
behavior, it would be useful to estimate the performance 
and identify the bottlenecks of proposed systems. The Con- 
cert multiprocessor was used to simulate one such system. 


The purpose of the simulator is twofold: to provide 
an accurate emulation of the proposed machine architecture 
that is fast enough to run significant programs and, more 
importantly, to analyze the dynamics of the proposed archi- 


Operation Average Time 
create a future object 1.00 msec 
determine a future 
0 tasks queued on future 0.43 msec 
1 task queued on future 1.1 msec 
touch a future 
future determined 0.24 msec 
future undetermined 
until task enqueued on future 0.86 msec 
total, excluding time to find task 2.6 msec 
start new task (once a task found) 1.2 msec 


Figure 7. SyYSM Timings for Various Future Operations 


| one een eae ee 
0 1 2 
RingBus Distance 2 4 5 
Number Reads 8592 1086 460 
Avg Read Time (usec) 4.9 6.4 6.5 
Number Writes 626 100 59 
Avg Write Time (usec) 625 6.6 el 
Number TAS Successes 568 48 25 
Number TAS Failures 8 17 0 


Cluster Number 


3 4 5 6 7 SySM 
4 3 n/a 1 2 n/a 
342 3206 108833 5070 53729 0 
3.5 3.5 0.8 3.4 530 0.0 
0 0 48428 537 276 43553 
0.0 0.0 0.9 3.9 6.0 0.8 
0 0 5039 253 142 0 
0 0 252 27 17 0 


Figure 8. Multibus Traffic Generated by a Processor in Cluster Five While Executing (ffib 20) 
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tecture. In order to reveal the characteristics of an architec-' 
ture, programs of significant size must be run. Without run- 
ning sufficiently large programs, the true dynamics of the 
system cannot be observed. 


Modeling a parallel computer is a job particularly 
well suited to parallel computers. The problem of simulat- 
ing a physical system naturally decomposes into parallel 
subproblems because of its inherently distributed nature. 
The Concert multiprocessor is general enough to support 
simulation of the YARC static dataflow architecture, and 
fast enough to allow large programs to be executed. The 
simulator uses the hardware instrumentation in Concert to 
measure the pertinent parameters of execution and to sepa- 
rate the details of the Concert system from the those of the 
YARC architecture. 


The YARC system simulated on Concert is a token- 
based static dataflow system. A collection of template 


storage modules/token processors is associated with an 
arithmetic function unit to form a processing ensemble. Pro- 
cessing ensembles are connected by a routing unit in a ring 
or toroidal configuration. 


This target architecture is mapped directly onto Con- 
cert. The similarities between the organization of Concert 
and YARC permit each processing ensemble to be mapped 
onto a Concert cluster. The inter-cluster communication 
directly reflects the communication between the processing 
ensembles in the YARC system. In a similar manner, the 
individual template stores are mapped onto Concert proces- 
sors. All template storage module communication takes 
place in Concert global memory. 


This direct mapping permits the execution of the 
simulator on Concert to approximate the actions of the 
YARC architecture. SySM and DLA are used to account 
for (and factor out) the anomalous effects of the Concert 
architecture as well as monitor the execution of YARC. 
This is a crucial aspect of the simulator system. The static 
mapping of simulated entities to specific processors permits 
the instrumentation, which is processor-oriented, to be 
brought directly to bear. 


As an example of usefulness of the SySM data, see 
Figure 9. Many of the entries are an exact enumeration of 
YARC’s activity. This data, along with performance mod- 
els of various potential physical YARC implementations, 
can be used to predict the performance of those systems. 
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Actions such as queueing or dequeueing are made 
necessary only by the structure of the simulator. The time 
it takes on Concert is unimportant to the extent that it does 
not affect the rest of the target system by serializing opera- 
tions. The average time for each entry represents the token 
activity in YARC, and may be used to better understand or 
model its activity. Any variation from the average repre- 
sents contention for global memory, which provides an 
approximation of the contention YARC would experience. 
While the time data produced by SySM is not directly use- 
ful, the variation among different executions represents dif- 
fering utilization of the target system. 


A substantial portion of the total token traffic is 
acknowledgments (ACK tokens) to the data tokens, as 
shown in Figure 9. Acknowledgment tokens constitute 
synchronization overhead in the static dataflow model. 
There are techniques to reduce such overhead at the 
expense of reducing parallelism. 


The lack of parallelism in this problem, due to the 
statically distributed nature of the model, is indicated by the 
large value of the Queue wait entry. This processor spent 
about 33% of its time waiting for incoming data. In contrast, 
the processor that contained the critical path waited only 
twice, and other processors in the system spent up to 63% 
of their time waiting. This imbalance shows that the simple 
allocation scheme used needs much improvement. 


Processor 62: (all times are in microseconds) 


State Entered Time Average Percent 
INITIALIZE 1 1956 0.05 
Total tokens 4128 231562 5755 6.27 
Memory tokens 0 0 0.00 
Data tokens 1744 58314 33.44 Loe 
ACK tokens 2384 39254 16.47 1.04 
Complete check 4128 367502 89.03 9.70 
Template firings 1288 151090 a ip Ba 2 8 3299 
Inspect Token 4528 253876 56.07 6.70 
Reset template 1288 124893 96.97 3.30 
Tokens enqueued 4144 151223 36.49 3.99 
Queue wait 294 1253528 4263.70 33.09 
Dequeue token 4128 482214 116.82 12.73 
Local enqueue 0 0 0.00 
Foreign enqueue 4144 624905 150.80 16.49 
Termination chk 472 42232 89.47 A ieee he 
Termination inc 0 0 0.00 
*x* Total ** 3788549 


Figure 9. YARC Simulation Data Obtained by SySM 


5. Conclusions 


Special purpose instrumentation can be an effective 
tool for evaluating the performance of a multiprocessor. 
Two devices were developed and used in a 64 processor 
shared memory multiprocessor. DLA measures the delay 
Processors experience contending for shared physical 
resources and the losses attributed to memory access 
latency. SySM provides execution profiling statistics with a 
minimum of intrusion. The functionality and operation of 
both instruments were described. Examples of their appli- 
cation to parallel processing research were presented, 
along with data reflecting performance losses within each 
application. The SySM and DLA provide a mechanism for 
observing detailed characteristics of System operation, pro- 


viding quantitative feedback to support systematic 
research. 


SySM and DLA can be used to provide performance — 


data in a variety of parallel applications. The YARC emula- 
tor uses the highest level tool, SySM, as an intrinsic and 
explicitly coded part of the simulator, but does not use the 
low-level measurement capabilities of DLA. SpoC uses 


both tools as part of the run-time system, and provides 
optional user coded SySM states. Multilisp uses SySM to 
measure a small but important sections of the run-time Sys- 
tem and, in contrast to YARC, makes heavy use of DLA to 
estimate communication and locality. 


There are several advantages in the current SySM 
and DLA hardware. The small amount of intrusion in SySM, 
and lack thereof in DLA, minimize the impact of measure- 
ment on application performance. In addition, the small 
intrusion allows accurate, fine-grained measurements to be 
made. SySM and DLA operate in real time; the data they 
derive represent real times, not synthetic values produced 
from simulation. The ability to integrate both devices into 
significant parallel applications with a minimum of overhead 
is an important feature for systems developers. 


The use of SySM and DLA has revealed limitations 
that should be considered in the design of more advanced 
instrumentation. SySM and DLA were developed sepa- 
rately; each performs its monitoring functions independently 
of the other. An unfortunate consequence is that data from 
each SySM segment cannot be directly correlated with con- 
tention and latency losses derived from DLA. Integration 
of the two systems would permit easy determination of 
hardware performance losses within each segment. Cur- 
rently, this can only be estimated from the averages sup- 
plied by DLA or by starting and stopping DLA measure- 
ments on the boundaries of a specific segment. 


While SySM is only slightly intrusive, there is a 
lower bound on the effective segment length. Each seg- 
ment transition requires a single instruction cycle, obscur- 
ing the measurement of short instruction sequences. This 
problem can be alleviated by providing an associative buffer 
that stores the segment transition addresses. The buffer 
would monitor the address bus, looking for matches. When 
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a match occurs, the registers of the relevant segment would 
be updated. This mechanism performs the same function as 
the current SySM but in a totally nonintrusive manner. 


The associative buffer technique could also be used 
to control experiment interval windows. DLA and SySM 
could be turned on and off at selected points during program 
execution to acquire measurements of just part of the pro- 
gram. The size of the time interval that can be effectively 
windowed is currently constrained by the intrusive nature of 
the start and stop commands that must be provided explicit- 
ly in the code. The use of an associative buffer would elimi- 
nate this source of intrusion, permitting fine grained window 
intervals for experiments. 


Parallel programs that are dynamically scheduled by 
underlying run time software do not reside on any one pro- 
cessor but migrate throughout the system. SySM measure- 
ments are processor oriented, unable to track specific tasks 
as they move among processors. This processor orienta- 
tion can make it difficult to study dynamically scheduled 
applications. More sophisticated instrumentation is needed 
to support this type of analysis. SySM and DLA compute 
the average execution time of a specific event, but do not 
provide information about events of variable duration. Two 
solutions to this problem have been implemented in other 
experimental instruments. The System Activity Moni- 
tor[27] calculates the sum of squares in real time, yielding 
the variance of the parameters being measured. The Spec- 
| tron[28] generates histograms of the parameters, rather 


than averages. 


Although SySM can be used to acquire traces of pro- 
cessor activity versus time, the cost in processor overhead 
and memory utilization is prohibitive. Future instrumenta- 
tion hardware should migrate this functionality from the pro- 
cessor into the monitoring hardware. In conjunction with 
this problem, synchronizing multiple SySM time bases in 
different clusters is difficult. Future versions of SySM 
should be less cluster chauvinistic, and support system 
wide communication. 


Finally, performance losses are not the only way to 
characterize system behavior. While loss measurement 
focuses on the temporal resources of a parallel computing 
system, alternate measurements could examine system 
resources such as memory usage. This can be particularly 
important when studying cache demands in multitasking 
and virtual memory systems, areas in which the described 
work does not readily apply. 
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ABSTRACT 


We consider the parallel solution of sparse systems of linear 
equations. In such systems, parallelism and communication pat- 
terns are dependent on the nature of sparsity in the input matrix 
system. A new algorithm, Block Solve, in which processors 
access blocks of rows from shared memory, is described. Experi- 
ments were carried out on the eight-processor Alliant FX/8 to 
determine the effectiveness of various blocking strategies in reduc- 
ing execution times. An average block size of between four and 
eight minimized execution times. The Alliant FX/8 was used to 
emulate the execution of Block Solve on a shared memory mul- 
tiprocessor with private memories. 


1. INTRODUCTION 


In shared-memory multiprocessor architectures, communica- 
tion and synchronization overhead can significantly affect perform- 
ance. Synchronization overhead is incurred when serialized access 
to shared variables must be enforced thus resulting in contention 
on shared lock variables. In shared-memory multiprocessors 
where each processor has a local memory or private cache[1], 
access to variables in the shared global memory is slower than 
accesses to variables in the private cache. For such multiprocessor 
systems, the communication delay is the difference between the 
delay in a write followed by a read on a shared variable from 
shared memory and the delay in a write followed by a read on a 
private variable from local memory. In many algorithms and 
problems, increasing the task granularity decreases the communi- 
cation and synchronization overhead. However, allocation of large 
tasks to processors increases the likelihood of some processors 
being forced to idle, i.e., worsens load balancing. In problems 
and algorithms, where communication and parallelism are rela- 
tively independent of -the input data, analytic models and deter- 
ministic schedules are applicable, e.g., FFT algorithm. However, 
such models are not usually applicable where the communication 
patterns are not uniform and regular. 
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In this paper, we consider the solution of sparse systems of 
linear equations, in which parallelism and communication patterns 
are dependent on the nature of sparsity in the input matrix system. 
An asynchronous parallel algorithm is developed to solve the 
matrix system. Dynamic techniques are used in this algorithm to 
estimate the current parallelism, i.e., the number of row operations 
that can be performed simultaneously. The estimate of the current 
parallelism is used to continuously balance parallelism and com- 
munication requirements. When a large amount of parallelism is 
available, individual processors are assigned large tasks. How- 
ever, when the parallelism decreases, for instance, towards the end 
of the computation, the task granularity is decreased to improve 
the utilization of the processors. 


A new algorithm, Block Solve, in which processors access 
blocks of rows from shared memory, is described. This algorithm 
is a generalization of Gaussian Elimination with pairwise pivoting, 
in which processors only access pairs of rows from shared 
memory. In the Block Solve algorithm, processors access blocks 
of rows, where the block size, (the number of rows brought into 
local memory) need not necessarily be two as in pairwise pivoting. 
Various blocking strategies that control the block size, i.e., the 
number of rows accessed by each processor on each access to 
shared memory, are described. Blocking is an attempt to 
parameterize the behavior of the linear system solver with respect 
to interprocessor communication. The right choice of block size 
balances communication requirements and parallelism and optim- 
izes performance. 


Experiments were carried out on the eight-processor Alliant 
FX/8 to determine the effectiveness of these blocking strategies in 
reducing execution times for various linear systems. The execu- 
tion of the Block Solve algorithm on a multiprocessor system with 
private caches for each processor and a slower shared global 
memory is emulated on the Alliant FX/8. The interrelationship 
between blocking and global communication delay was examined 
by measuring performance for different types of matrix systems 
using a range of block sizes and global delays. 


Increasing the block size reduces communication and syn- 
chronization overhead and thereby reduces completion times. 
However, a large block size increases idle time of processors. 
Measurements indicate that a moderate block size of between four 
and eight balances these conflicting requirements and minimizes 


execution times on the Alliant FX/8 for a range of matrix systems. 
In a multiprocessor system with a shared memory and _ local 
memories for each processor, the effect on performance of large 
shared global memory delays relative to local memory delays is 
reduced if appropriate block sizes are chosen through the blocking 
strategies presented in this paper. 


The number of processors in shared-memory multiprocessor 
systems continues to increase. In future systems, the shared glo- 
bal memory will be much slower than the private local memories 
of individual processors. The algorithm described here attempts to 
reduce the number of accesses to the relatively slow global 
memory and therefore is likely to be superior to existing direct 
solvers for such multiprocessor systems. Furthermore, the tech- 
nique of modifying task granularity based on a current parallelism 
estimate, may be applicable to other important numerical algo- 
rithms. 


In Section 1, the problem domain and the algorithm that was 
implemented are described. In Section 2, the tradeoff between 
synchronization and computation is examined. In Section 3, the 
results obtained by varying the block size in a parallel linear sys- 
tem solver running on the Alliant FX/8 are presented. In Section 


4, the Alliant FX/8 is used to emulate a machine with private 
memories and a single shared global memory. In each of the 
above two sections, details of the machine and the algorithm are 
followed by the results of experiments performed by running the 
algorithm. 


2. THE SOLUTION OF SPARSE LINEAR SYSTEMS 


The problem domain is the solution of a sparse system of n 
simultaneous linear equations, represented as Ax =b, where A 
is a (possibly unsymmetric) sparse NxN coefficient matrix, and x 
and b are N-vectors. A and b are known and it is necessary to 
determine the N-vector x. In a general sparse system, there are 
relatively few nonzero elements in A, but the distribution of the 
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Fig. 1. Pairwise reduction and Givens’ reduction 


nonzero elements does not fall into any regular pattern. A large 
number of computationally expensive scientific and engineering 
applications, e.g., structural analysis, fluid dynamics, aerodynam- 
ics, computer-aided design, and circuit simulation, are based on 
the solution of large sparse systems of linear equations [2]. It is 
therefore important to develop good parallel algorithms for solving 
sparse linear systems. 


LU decomposition is a direct method for solving linear sys- 
tems [3]. It involves a forward reduction phase that obtains lower 
and upper triangular matrices, L and U, where A= LU, and a 
back substitution phase to get the solution vector x. Only per- 
formance of the forward reduction phase is analyzed in this paper 
since it is computationally more expensive than the back substitu- 
tion phase. Parallel algorithms for solving 4x =b when A is 
dense (i.e., when most coefficients are non-zero) employ schedules 
where the actions of each of the processors are predetermined 
before run time [4]. These algorithms are not efficient for general 
sparse systems. 


2.1. Pairwise Solve 


Pairwise Solve, or PSolve, is an asynchronous, nondeter- 
ministic, parallel algorithm based on pairwise pivoting [5,6]. Con- 
sider two rows of the A matrix whose leading (or leftmost) 
nonzero elements lie in the same column. If one row (called the 
pivot row) is multiplied by an appropriate factor and added to the 
second row, the leading nonzero of the second row can be reduced 
to zero, thus simplifying the equation corresponding to the reduced 
row. In pairwise pivoting then, elementary 2 x 2 stabilized elimi- 
nators S are constructed and the pair of rows is premultiplied by S 
to create a zero (Figure 1)[7]. The column index of the leading 
nonzero element in a row is referred to as the column index of the 
row. 


The algorithm uses the data structures shown in Figure 2 to 
detect parallelism efficiently. A column list, col;, associated with 
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Fig. 2. Sparse matrix data structure 


each column j, is a list of rows that have a column index of j. 
Rows with fewer nonzeros are kept to the front of the column lists 
to reduce fill-in, i.e., the extent to which zero-valued elements in 
the original A matrix are converted to nonzero elements by the 
reduction process. Only the nonzero values of the rows are actu- 
ally stored and operated on. 


To find work, a processor scans the column lists for a list 
with two or more rows. It removes the first two rows in the list, 
reads the pivot row and returns it to the original list unchanged, 
and reduces the second row, thereby increasing its column index 
by at least one. The second row is put in a new column list 
corresponding to its new column index. The algorithm completes 
when the A matrix is upper triangular and each column list con- 
tains exactly one row. The advantages of PSolve have been 
demonstrated by measuring execution times on the Alliant FX/8 
for 38 test matrices from the Harwell/Boeing sparse matrix collec- 
tion [6]. 


2.2. Characterization of parallelism in Psolve 


The amount of parallelism, ie., the number of row reduc- 
tions that can be executed simultaneously in this context, is a 
function of the input matrix structure, specifically the pattern of 
zero and nonzero elements. Parallelism also varies during the exe- 
cution of the algorithm. The amount of parallelism determines the 
number of processors that can be used effectively and the follow- 
ing attempts to characterize the parallelism in Pairwise Solve. 


Let n; be the number of rows in column list 7. Let N; and 
N, be the number of column lists with x j equal to one and zero, 
respectively. Assume that, during each row reduction, only the 
row being reduced is locked for exclusive access by a processor 
and a pivot row is simply read and released. An upper bound on 
the maximum number of simultaneous row reductions and hence 
the currently available parallelism is N —N,-- 1, since the N, 
rows belonging to the N; column lists with n;=1 cannot take part 


in row reductions. 


Lemma 1. The pairwise algorithm terminates within N(N—1)/2 
reduction steps, if the N xN matrix, A, is nonsingular. 


Each reduction step creates a new zero below the diagonal. Such 
zeros are not converted back to nonzero elements. Since the ini- 
(ial maximum number of nonzero elements below the diagonal is 
N(N-1)/2, the pairwise algorithm terminates within N(NV-1)/2 
reduction steps. 0 

Lemma 2. The currently available parallelism is N,. 


Proof: Column lists, j, with n; 2 2 contain rows on which further 
reductions can be currently performed. In such columns lists,  ;—1 
rows can be simultaneously reduced by n;—1 processors using the 
remaining n;th row. Thus, the parallelism for Pairwise Solve, P 2, 


is N 
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pine’ can of the N rows is associated with exactly one column 
list, }inj=N. Expanding (1) gives 
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Thus, the parallelism, P2, is N,, the number of empty column 
lists. 0 


Lemma 3. The currently available parallelism, P,, is a monotoni- 
cally nonincreasing function of time during the execution of the 
algorithm. 


Proof: No row reduction can eliminate the last row from a 
column list. Hence, once a column list, 7, has a jai, it will con- 
tinue to have n;>0. Therefore, N, and hence P, are monotoni- 
cally nonincreasing. 0 


2.3. Block Solve 


Block Solve is a new algorithm developed in this paper to 
explore blocking. It is based on the Pairwise Solve algorithm. In 
Block Solve, a processor accesses a block of k rows from a set of 
contiguous column lists. The row containing the leading element 
that has the largest absolute value among the rows with the least 
column index is not reduced. All other rows may be reduced and 
are locked for exclusive access. Pairwise row reduction steps are 
performed on rows within the block until no two rows share the 
same column index. At this stage, no further row reductions can 
be performed locally within the block. The processor releases the 
block of rows and accesses a fresh set of rows. This is a generali- 
zation of Pairwise Solve in the sense that Block Solve with a 
block size of two is identical to Pairwise Solve, where block size, 
k, is the number of rows accessed. The available parallelism with 
a block size of k, P;, is the number of size k blocks that can be 
reduced simultaneously. Since each processor reserves (k—1) 
rows for exclusive access, the maximum parallelism in Block 
Solve is GV — 1)/(K — 1), as compared to N ~1 in Pairwise Solve. 
As the block size, k, increases, the parallelism decreases. If the 
number of processors, p, iS constant, some processors must idle 
when P, falls below p. 


Despite this drawback, a block size, k, greater than two 
results in better performance when interprocessor communication 
is expensive. In a multiprocessor system with private caches and 
a shared global memory, a block size of k>2 reduces the com- 
munication time of a processor, i.e., the time spent accessing data 
from global memory. In the ideal case, the communication time is 
reduced by a factor of (k—1)/2 as indicated in the following. 


A row i is k-solid if the k—-1 elements following the lead- 
ing nonzero element are also nonzero, i.e., (4; 1° °° a; 3-1} are 
zero and {a;;,° °° Gj j+x-1} are nonzero. A block of k rows is 
solid if all rows are k-solid and have the same column index. 
Since k(k-—1)/2 elements in the block must be reduced to zero 
before releasing the block and since each row reduction step 
reduces one element to zero, a processor performs k (k—1)/2 
reductions after accessing a solid block of k rows. Therefore, the 
compute-to-communication ratio is (k-1)/2 in units of “row 
reductions per globally accessed row." In rare instances, a row 
reduction step may reduce two or more elements to zero, and con- 
sequently, a processor may perform fewer than k(k-1)/2 reduc- 
tions. In such cases, the compute-to-communication ratio 
approaches, but is less than, (k—1)/ 2. 


A solid block represents the ideal case. In general, a proces- 
sor may not find k k-solid rows with the same column index and, 
in this case, the compute-to-communication ratio is less than 
(k-1)/2. Each processor scans successive columns for a column 
list, 7, with n ie 2. The first set of rows is dequeued from this 
column list. This criterion is used because if the first row is from 


a column list with n,;=1, then ‘this row cannot be used in any row 


-Yeduction step. A processor then scans additional columns, if 
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needed, until it acquires a total of k rows. Single rows may be 
acquired from succeeding column lists, i.e., rows from column 
lists j with n;=1, because row reductions can usually be per- 
formed on such rows. For instance, if two rows are acquired from 
column 5 and a single row is acquired from column 6, a row 
reduction involving the two rows from column 5 usually results in 
a row with a leading nonzero element in column 6. A row reduc- 
tion step is then performed using the row from column 6 and the 
reduced row obtained from the first row reduction. 


The following early-quit criterion may terminate a scan 
before k rows are acquired. A scan is obviously terminated once 
column N has been scanned. The scan is also abandoned when 
the difference between the current column and the column from 
which the first row was acquired is greater than the number of 
rows already acquired. This early-quit criterion abandons a scan if 
it is expected that acquiring an additional row will not reduce 
communication time. If this condition holds, rows from the 
current column list cannot be reduced using the rows already 
acquired. Therefore, accessing additional rows reduces parallelism 
but does not reduce communication time. For instance, if a pro- 
cessor acquires the first two rows from column five, reserves a 
total of four rows, and is scanning column nine, then the scan is 
ended and the processor attempts to reduce the four rows 
acquired. Acquiring additional rows from column nine does not 
reduce communication, since the four rows acquired cannot be 
used to reduce rows acquired from column nine. However, 
acquiring additional rows would reduce the number of rows avail- 
able to the remaining processors. 


2.4. Givens’ reduction 


Another basic difference between the Pairwise Solve algo- 
rithm [5] and the Block Solve algorithm presented here is in the 
strategy used to create zeros to transform the matrix A into upper 
triangular form. While pairwise pivoting was used in PSolve, 
Givens’ reduction is used here. In Givens’ reduction, the premul- 
tiplication is performed by the 2 x 2 matrix shown in Figure 1. 
The computation required in Givens’ reduction to create a zero is 
twice that in pairwise pivoting. Furthermore, in a parallel imple- 
mentation, both rows involved in a row reduction step must be 
locked, as both rows are modified. Since the earlier results on 
parallelism assumed that only one of the rows is locked, those 
results must be modified appropriately. The major advantage of 
Givens’ reduction is its increased numerical stability. 


2.5. Matrix systems 


In this paper, results obtained on two types of sparse 
matrices are presented. Synthetic sparse matrices are parameter- 
ized by the size of the matrix and width, ewidth and scatter. An 
element, a;;, is nonzero if |i-j|<width and has a probability, 
scatter, of being nonzero if width <|i—j|<ewidth; it is always 
zero if |i-j| > ewidth. Two types of synthetic sparse matrices 
were chosen with the parameters, (width, ewidth, scatter) chosen 
to be (3, 60, 0.01) and (10, 100, 0.01), respectively. Six real 
sparse matrices were chosen from the Harwell/Boeing sparse 
matrix collection [8]. 


3. BLOCKING ON THE ALLIANT FX/8 


Execution times were determined on the Alliant FX/8 for a 
range of matrices and block sizes. The Alliant FX/8 is a shared- 
memory multiprocessor in which all eight processors share a com- 
mon cache. Therefore, access to shared and nonshared variables 
takes the same time, and communication delay is zero. However, 


a large block size reduces the number of mutually exclusive 
accesses to column lists. As a result, a large block size reduces 
the degree of contention for shared variables and consequently the 
synchronization time. 


The PSolve algorithm was observed to have a speedup of 
between five and seven on an eight-processor system for a wide 
range of matrices from the Harwell/Boeing collection. The Block 
Solve algorithm improves on this speedup by varying block size. 
Since the PSolve algorithm achieves appreciable speedup, the 
improvement that can be obtained using Block Solve on an eight- 
processor system is limited. However, in larger multiprocessor 
systems, the degree of contention and hence the synchronization 
overhead increases. In such systems, the blocking techniques in 
Block Solve that reduce synchronization overhead are more 
effective. 


3.1. Constant and variable blocking 


The maximum block size is the maximum number of rows 
that a processor can acquire on each access. If one of the early- 
quit criteria is applicable, a processor may terminate a block 
access and perform a block reduction on a block size smaller than 
the maximum block size. In constant blocking, the maximum 
block size is fixed throughout program execution. In variable 
blocking, the maximum block size is based on an estimate of the 
currently available parallelism. Variable blocking is superior 
because parallelism decreases during program execution (Lemma 
3) and the choice of block size should attempt to rebalance com- 
munication and parallelism continuously in order to optimize per- 
formance. 


Two strategies for obtaining an estimate of the available 
parallelism are described in the following. In the C, method, a 
count, c,, of the number of elements in the set, C,, of consecutive 
column lists {1 -- +c ,} with n;=1 is maintained. Since no reduc- 
tions can be performed on the set of rows with column indices in 
the set C,, (N—c,) is an estimate (rather optimistic) of the avail- 
able parallelism. In the N, method, a count of the number of 
columns with 7;=0, N, is maintained. The implementation of this 
method uses the fact that once n; has a nonzero value it will never 


‘be zero again. Before program execution, N, is initialized to its 


correct value. When a reduced row is returned to shared storage 
by a processor, N, is decremented if no other row has the same 
column index. 


The N, method of estimating parallelism is more accurate 
than the C, method. For instance, consider the solution of a tridi- 
agonal system using the Block Solve algorithm. In a tridiagonal 
matrix, element a;; is nonzero if and only if |i—j|< 1. Exactly 
one row reduction can be performed at each step and the parallel- 
ism is one. However, the amount of parallelism indicated by the 
first method is N —c,, where c, is initialized to zero and is incre- 
mented after each reduction. The N, method accurately indicates 
a parallelism of exactly one throughout the reduction phase, since 
only one column, column N, has nj=0. When the last reduced 
row is returned, my=1, the available parallelism is zero and the 
reduction phase is complete. However, the C, method is 
sufficiently accurate for dense matrices. Furthermore, the count, 
Cy, 18 maintained by one processor. In contrast, since the count, 
N,, is decremented by all processors, contention on the 
corresponding variable can degrade performance in a large system 


’ with several processors. 
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In a variable blocking scheme, an estimate of the parallelism 
is multiplied by some factor in order to obtain the current max- 
imum block size. Since the C, method overestimates parallelism, 


this factor compensates in part for the overestimation. Even in the 
accurate NV, method, performance may be improved by varying the 
multiplicative factor. Two methods of specifying the multiplica- 
tive factor are outlined. The first specifies the maximum block 
size, k,,ax, When there is full parallelism. Thus, the current max- 
imum block size is chosen to be max(2,kmax,(N—c,/N) or 
max (2,k max(N,/N)). The choice of k,,,, should be based on the 
ratio of actual to estimated parallelism and on the number of pro- 
cessors in the system. The second method specifies blocking 
independently of the number of processors as a parallelism factor. 
Given a parallelism factor, prz-, the block size is chosen to be 
max (2,(V-c 1 (P-Prac )) or max (2,N,/ (P Prac )). The K max method 
limits the maximum number of rows that are brought into local 
memory. This restriction is useful, for instance, if the size of a 
processor’s local memory poses a limitation on block size. In the 
Pfac Method, the specification of the parameter ps,, is independent 
of the number of processors. Thus, the optimum p;,, on a partic- 
ular multiprocessor system might be expected to give good per- 
formance on a system with a different number of processors. 


3.2. Data structures in the implementation of Block Solve 


The real-valued arrays a and b contain the nonzero array ele- 
ments of A and the vector b. The N x N array c contains the 
column index of the corresponding element in a. For instance, if 
c(i,j) contains k, then the element a(i,j) is the value of A(,k). 
Thus, only the nonzero elements of A are stored explicitly. The 
integer array e of length N contains the number of nonzero ele- 
ments in each row of the matrix. On each row access, the 
corresponding element in the array e is used to determine the 
number of elements that should be accessed from a and c. In 
addition, the array row_next of length N is used to maintain 
column lists. Since a row belongs to one column list at any 
instant, all the N column lists are maintained using row_next. 
Each column list has a head pointer that points to the first row in 
the list. The row_next pointer of the first row points to the next 
row in the list and so on. The row_next pointer of the last row in 
the list is set to NULL. 


The data structures also maintain column-based information. 
A column is locked by a test-and-set instruction on the 
corresponding element in col_lock. The jth entry in the array 
col_elements contains n;, the number of rows with column indices 
equal to j, minus any rows currently reserved from column list j 
_ by processors. The entries in the array col_missing contain the 
number of rows reserved for exclusive access by processors. 


During a block reduction, when no further reductions can be per- 
formed with a row, it is transferred back to shared storage and the 
corresponding col_elements entry is incremented. At the end of a 
block reduction, a processor subtracts its contribution from the 
array col_missing. The jth element in the pointer array col_head 
points to the first row in the list of rows with column index j. 


The following data structures are used in Block Solve but 


not in PSolve. A scalar integer head_col that points to. the first © 


column with n;>1 is maintained. The value of head_col is equal 
to (c,+1) and hence head_col is useful in obtaining an estimate of 
the current parallelism in the C, method. Since columns j with 
j Sc, have n j =1, no further reductions are possible on rows 
contained in columns {1,--: ,c,}. Therefore, processors use 
- head_col to skip over the first c, columns. Thirdly, when 
head_col is N, the matrix is in upper triangular form and the 
reduction phase: is complete. Therefore, the processors check 
head_col and terminate execution when head_col is equal to N. 
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The head_col pointer is maintained to improve performance; 
an up-to-date value of head_col is not necessary for the correct- 
ness of the algorithm. The head_col pointer is initialized to 1, and 
thereafter, in order to simplify implementation. and reduce syn- 
chronization, only processor 1 advances head_col. To the other 
processors, head_col is a read-only variable. Processor 1 incre- 
ments head_col if, during a block access, the column indexed by 
head_col has col_elements equal to 1 and col_missing equal to 0. 


Lemma 4. The procedure for advancing head_col guarantees that 
all columns j with j < head_col always have n;=1. 


Proof: Reduction steps performed by the processors never 
decrease the column index of a row. Therefore, if a row is read 
with column index / and is written back with a column index m, 
then m 21. When head_col is incremented, the column previ- 
ously indexed by head_col has only one row associated with it. 
Ensuring that col_missing is zero guarantees that rows accessed 
from these columns have been returned before head_col is 
advanced. Rows accessed from column lists j with j = head_col 
will be returned to column lists with indices no less than head_col. 
Therefore, all column lists j with j < head_col will continue to 
contain exactly one row. 0 


3.3. Implementation of the algorithm 


A processor scans column lists beginning from head_col. It 
precedes accesses to column data structures by a lock operation on 
the corresponding element of col_lock and releases the lock on 
completing the data access. On encountering a column list with 
col_elements greater than 1, the processor dequeues all rows up to 
a maximum of the current maximum block size. If the number of 
rows acquired is less than the maxinyam biock size, the processor 
scans additional columns accessing one or more rows. A scan 
may be terminated by the early-quit criteria. 


Once a set of rows has been reserved exclusively for a pro- 
cessor, Givens’ reduction is used to reduce the set of rows until 
no two rows share the same column index. A local list of rows in 
the current block is maintained in sorted order using the respective 
column indices as the first key and the number of nonzero entries 
in a row as the second key. The second key is chosen to reduce 
fill-in. If the two rows at the head of the local list have different 
column indices, the row at the head is written back to shared 
storage since it cannot be used for performing further reductions. 
Otherwise, the rows are dequeued, one of the rows is reduced and 
the two rows are inserted into their correct positions in the local 
list. Ultimately, when there is just one row left, it too is written 
into shared storage. At this stage, since all rows have been 
returned, the col_missing array is updated. The above is repeated 
until head_col is advanced to N. 


3.4. Measurements and analysis 


This section describes, presents and analyzes the measure- 
ments that were taken to explore blocking on the Alliant FX/8. 
The four blocking schemes that were implemented and the current 
maximum block size, max_blk, for each of the four schemes are 


(1) const_blk blocking uses constant blocking where the max- 
imum block size is fixed throughout program execution and 


is one of the command-line parameters. 


(2) hd_var blocking uses variable blocking where an estimate of 
the parallelism from the head_col variable is scaled by a 
command-line parameter, k,,,,, which is the maximum 


blocking size with full parallelism. 


k max: ((N +1)—head_col ) 
max_blk = max ra aa 


(3) 


(3) pf_var blocking is similar to hd_var except that the parallel- 
ism estimate is scaled by the command-line parameter, Pra, 
and the number of processors in the system. 


_hea 1 | 
a bik Sinan 2 (W +1) head_col ) (4) 


: Prac p.N 


(4) nz_var blocking is similar to pf_var but uses the number of 
columns with n;=0, N,, to estimate parallelism. The same 
value of the command-line parameter, Pfacjnz» might be 
expected to give similar performance for a range of matrices, 
since the program dynamically adjusts to the available paral- 
lelism. 


vA 


max_blk = max |2,————— (5) 
Prac Nz PP 


Experiments were conducted by running Block Solve on the 
Alliant FX/8 using various maximum block sizes and parallelism 
factors, for each of the matrices and blocking schemes described 
above. In each case, the execution time, and the average block 
size were measured and tabulated. The average block size is the 
average number of rows that a processor acquires on each access 
to the global queue. This number is never greater than the max- 
imum specified indirectly through the command-line parameters, 
kmax OF Prac- Other measurements that were carried out for each 
run of the program include the error norm, the density of nonzeros 
in the upper triangular matrix, the number of floating-point opera- 
tions, the number of row reductions, the number of accesses, the 
number of columns scanned on each access (skips), and the 
megaflops rating attained. 
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Fig. 3. Execution time (seconds) vs. average block size 
(for various blocking schemes on a type2 size 1000 matrix) 
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Each of the four different blocking schemes that were imple- 
mented restricts the number of rows that a processor can acquire 
on each access. This restriction implicitly determines the average 
number of rows that a processor acquires on each access and 
thereby affects performance. For each blocking scheme, different 
ranges of the command-line parameters exhibit near-optimal per- 
formance. The execution times are therefore plotted versus the 
average block size, rather than versus the value of the command- 
line parameter. 


In Figure 3, the execution times are plotted versus the aver- 
age block size for each blocking scheme for a type2 matrix of size 
1000. The points correspond to observed values of execution time 
and average block size, for each value of k,,.x (Of Prac). A line 
joins each point to the point corresponding to the next higher 
value of k..,,- A higher value of k,,,, does not necessarily result 
in a higher average block size because there may be fewer rows 
for the other processors if one processor accesses a larger block. 
As a result, the lines occasionally move to the left. In general, the 
execution time decreases slightly with increasing block size for 
small block sizes, but then increases for larger block sizes. A 
block size of four, in general, gives better performance than a 
block size of two. Command-line parameter choices that result in 
average block sizes of between three and four are optimal for the 
variable blocking schemes, whereas a maximum block size 
specification of eight (with resulting average block size of 7.4) is 
optimum for the const blocking scheme. The more sophisticated 
blocking schemes are more robust in the sense that they give 
near-optimal performance over a wider range of the command-line 
parameters, Kyo, and Pyg-. In particular, the execution time with 
nz_var blocking remains close to the minimum of 10.50 seconds 
for any choice of the command-line parameter, prg,, in the range 
[0.3-6.0], whereas the execution time with const blocking remains 
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Fig. 4. Average block size achieved vs. 
specified parallelism factor, Prac 
(for type2 matrices with sizes 200, 400, 800, 1000) 


close to the minimum only for a choice of the command-line 
parameter, maximum block size, in the range [4-10]. 


For the rest of this section, runs using nz_var blocking on 
type2 matrices are examined further. Figure 4 shows the relation- 
ship between the command-line parameter, pr,,, and the resulting 
average block size. When py,,>1, the number of rows available 
for reduction exceeds the sum of the sizes of blocks in individual 
processors. The sizes of the acquired blocks are indeed limited by 
the maximum block size. Since the maximum block size varies 
inversely as Pra, the average block size also varies inversely as 
Prac: However, when p,,,<1, the block sizes are limited by the 
parallelism in the matrix system. Thus, each processor attempts to 
acquire blocks of size greater than N,/p, but since there are only 
N, rows available, the processors, in general, acquire fewer rows 
than specified by the maximum block size. Therefore, for p,,.<1, 
the average block size is only weakly dependent on py,,. 


In this section, various blocking schemes were analyzed. 
Blocking can improve performance to some extent even in mul- 
tiprocessor systems that have a shared cache and effectively zero 
communication delay because of a reduction in synchronization 
overhead. An average block size between three and four leads to 
minimum execution times for variable blocking schemes. Execu- 
tion time curves are fairly flat over block size ranges near their 
optimum value. The performance of all four blocking schemes 
- with the corresponding optimal choices of command-line parame- 
ters is comparable. However, a choice of maximum block size 
outside the near-optimum range can greatly increase execution 
time for the const blocking scheme. In contrast, the sophisticated 
blocking schemes, pf_var and nz_var, are more robust because a 
nonoptimal choice of py. does not significantly affect perform- 
ance. 


4. EMULATION OF GLOBAL DELAYS 


An eight-processor system where each processor has its own 
private memory and all the processors share a common global 
memory was emulated using the Alliant FX/83. The emulation was 
accomplished by identifying shared variables in the program and 
inserting additional accesses for each shared variable access to 
model the shared memory access time. The interrelationship 
between blocking and global communication delay was examined 
by measuring performance for different types of systems using a 
range of parallelism factors and global delays. 


4.1. Global delays 


In order to emulate a shared global memory multiprocessor 
with private memories for each processor, occurrences of variables 
shared between processors are identified. In the linear system 
application, the following are shared and therefore in global 
memory: a, b, c, e, row_next, col_elements, col_lock, col_missing. 
Thus, the entire sparse matrix data structure as represented in Fig- 
ure 2 resides in global memory. A block of rows is accessed 
from global memory and stored in local memory. Subsequent 
accesses to a row contained in the block are satisfied by the local 
memory. 


To emulate the additional time that it would take to complete 
accesses on shared variables from global memory, g_dly - J addi- 
tional accesses are made to locations in an otherwise unused array 
where g_dly is the factor by which the global memory is slower 
than local memory. Additional vector and scalar accesses are 
introduced for vector and scalar accesses respectively in the origi- 
nal program. Provided that all accesses on the host multiprocessor 
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on which the emulation runs (the Alliant FX/8, in this case) take 
identical time, this strategy should emulate the global delay accu- 
rately. 


However, the Alliant FX/8 has a hierarchical memory struc- 
ture with a virtual memory and a cache which pose special prob- 
lems in performance evaluation[9]. Therefore, the additional 
accesses used to emulate global delay were structured so that they 
do not benefit from the cache prefetching effect any more than the 
normal accesses (ie., they have similar spatial locality). For 
sufficiently large problems, cache hits are primarily due to pre- 
fetching an entire line (spatial locality). When the problem size is 
large, cached data tends to be replaced before it is reused. There- 
fore, the temporal locality is small and does not contribute 
significantly to cache hit ratio. With care taken to preserve the 
amount of spatial locality, the cache miss ratio is expected to 
approximate the miss ratio in the original program. Furthermore, 
the size of the dummy arrays used to emulate global accesses was 
chosen to be an order of magnitude smaller than the largest array 
in the program. This reduced the likelihood of thrashing on the — 
disk. Thus, the paging activity in the emulation is expected to 
approximate that in the original program. Further study is neces- 
sary to verify that these steps are sufficient for accurate emulation. 


4.2. Measurements and analysis 


The program was run on synthetic and real sparse systems. 
For each matrix system, the program was executed for choices of 
Prac in the range 0.1 through 5.0 and global delays of 1 through 
10 times the private memory access delays. 


In Figure 5, the execution times for the matrices, bp_1000 
from the Harwell-Boeing collection, synthetic sparse matrix of size 
1000 of type2, and synthetic sparse matrix of size 800 of typel, 
are plotted against parallelism factor, pya-. The multiprocessor 
system that is emulated has a global memory that is five times 
slower than the individual local memories. A parallelism factor 
slightly less than one is found to give the minimum execution time 
for the synthetic sparse matrix systems, whereas a parallelism 


factor around 0.5 is ideal for the bp_1000 matrix. Because of the 
early-quit criteria, the average block size is less than the maximum 
block size, even if sufficient parallelism is present. A parallelism 
factor of 0.5, therefore, does not mply that only half the proces- 
sors are busy. 


The execution times for fs_541_2 are examined in Figure 6. 
The global delay is chosen to be 3, 5, and 9 times the local delay 
and execution times are plotted against the parallelism factor. 
Parallelism factors at either ends of the range result in large exe- 
cution times, either because of a lack of parallelism or because of 
large communication delays. Furthermore, in the optimum range 
Of Prac, the effect of global communication delays is masked and 
execution times for global delays of 3, 5, and 9 are nearly identi- 
cal. This masking occurs because although changes in global 
delays do affect communication time, communication time is not a 
significant part of execution time in the optimum range of pra. A 
choice of parallelism factor around 0.5 minimizes execution times 
for all three global delays. An average block size of approxi- 
mately 13 is achieved with prs,-=0.5 which implies that on the 
average only about one-fifth of the 541 rows im fs_541_2 are typi- 
cally reserved by processors for exclusive access. 


5. CONCLUSIONS 


This paper examined blocking in the context of linear system 
solvers. A large block size decreases communication and syn- 
chronization at the expense of reduced parallelism. Four different 


implementations of blocking were examined by solving various 
linear systems on the Alliant FX/8. For the type2-1000 matrix 
running on eight processors, minimum execution times were 
obtained with an average block size of approximately four for 
variable blocking schemes and an average block size of 7.4 for the 
const blocking scheme. The more sophisticated blocking schemes 
that adjust the block size to match the currently available parallel- 
ism give good performance even if the user-specified parameters 
do not match the problem and the multiprocessor. 


The emulation of the execution of the Block Solve algorithm 
on a multiprocessor system with global memory that is slower 
than the private memories of the processors was described. 
Blocking is more useful in such a system because the communica- 
tion delays (as well as synchronization overhead) are reduced with 
larger block sizes. Parallelism factors are used to define task 
granularity for multitasking large problems; very small or very 
large factors result in large execution times either because of a 
lack of parallelism or because of large communication delays, 
respectively. A parallelism factor of approximately 0.5 minimizes 
execution time for a range of global delays. (Note that because of 
the early-quit criteria, a parallelism factor less than one does not 
necessarily imply that some processors are idle.) Furthermore, as 
global delay is increased, the execulion time does not imcrease 


rapidly with this choice of parallelism factor. With an average 
block size of 13, communication time is not significant compared 
to computation time; hence, execution time is not affected 
significantly by changes in communication delay. 


This paper presents a new algorithm, Block Solve, for solv- 
ing sparse systems of linear equations that is a generalization of 
the Psolve algorithm discussed in [6]. For most test matrices, the 
Psolve algorithm was found to run faster on the Alliant FX/8 mul- 
tiprocessor than Gaussian Elimination which does not exploit spar- 
sity and the Yale Sparse Matrix Package which does not exploit 
parallelism. In this paper, the performance of Block Solve with 
moderate block sizes is found to be superior to a block size of 
two, which is required by Psolve. The algorithm presented here is 
likely to find application in shared-memory multiprocessors that 
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Fig. 5. Emulated execution time (seconds) vs. parallelism factor 
(for various matrices on an eight-processor system 
with global delay five times local delay) 
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have private local memories that are significantly faster than the 
shared global memory. The Block Solve algorithm also introduces 
techniques to estimate parailelism during program execution and 
continuously balance communication requirements and parallelism 
based on the parallelism estimate. These techniques may be use- 
ful in other asynchronous algorithms. 
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Abstract -- This paper presents an analysis of an 
Alliant FX/8 system running Xylem (Cedar’s operating 
system) at the University of Illinois Center for 
Supercomputing Research and Development. Results 
for two distinct, real, scientific workloads executing on 
an Alliant FX/8 are discussed. A combination of user 
concurrency and system overhead measurements were 
taken for both workloads. Statistical cluster analysis is 
used to extract a state transition model to jointly 
.characterize user concurrency and system overhead. A 
skewness factor, is introduced and used to bring out 
the effects of unbalanced clustering when determining 
states with significant transitions. 


1, INTRODUCTION 


The evaluation of a parallel processor often 
consists of determining numerical performance indices, 
such as MFLOPS, for the machine using standard 
benchmarks. Although these indices are useful in 
detecting global weaknesses of the system, they are 
unable to provide detailed insight into system 
behavior.. It is important to have methods which 
provide information about the system’s performance 
under a certain workload, along with insight into how 
the workload and system interact. With such 
methods, the system can be more easily tuned for 
specific applications and vice versa. 


This paper presents an analysis of an Alliant FX/8 
system running the Cedar! operating system, Xylem, at 
the University of Illinois Center for Supercomputing 
Research and Development (CSRD). Results for two 
distinct, real, scientific workload samples executing on 
an Alliant FX/8 are presented. In the analysis, a 
combination of user concurrency and system overhead 
measurements are employed. Statistical clustering is 
performed on these measurements to identify 
-commonly recurring patterns of resource usage. State 
transition models are extracted and interpreted for 
both sampled workloads to obtain practical insight into 
the system behavior. Skewness factors are then 
calculated for each interstate transition in the 
identified model and used to determine significant 
transitional relationships among the states of the 
machine. 


The results show that during the collection of the 


1The Cedar project is a parallel supercomputing experiment 
which consists of interconnecting Alliant FX/8's to a large 
shared global memory [1] and [2]. Each Alliant is known as 
a cluster of the Cedar machine. 


This research was supported by the National Aeronautical 
and Space Administration under NASA grant NAG-1-613. 
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first sample, the system was operating in states of high 
user concurrency approximately 79% of the time. The 
second sample, on the other hand, captures a system 
operating in states of high user concurrency only 26% 
of the time. In addition, the analysis shows that high 
system overhead is usually accompanied by low user 
concurrency. The analysis also indicates that for both 
workloads, the state of the system was highly 
predictable. This predictability was largely due to 
slow changes in system states. In particular, states 
with extremely high values of paging or user 
concurrency are usually preceded by states with less 
paging and user concurrency, much like stair climbing. 
A stepping down effect is observed when the machine 
leaves these extreme states. 


1.1 Related Research 


There have been several studies which analyze the 


performance of multiprocessor systems. Most of these 


employ simulation or analytical-based techniques [3], 
[4], [5]. Few have investigated the effect of a real 
workload on system performance. In McGuire and 
Iyer [6] concurrency of real workloads executing on an 
Alliant is monitored and analyzed. The rest of the 
performance related work on the Alliant FX/8 has 
dealt mainly with the use of tools for evaluation or 
determination of performance indices [7], [8], and [9]. 


The current study not only analyzes real 
performance and resource usage data but also extracts 
transition models to represent the measured workload 
environment. The models are interpreted to gain 
insight into the interaction of the workload and system 
and to determine the amount of concurrency in the 
workloads. 


A major step in obtaining the workload models is 
Statistical clustering. In recent years, this approach has 
found many uses in the field of computer evaluation 
[10]. Devarakonda and Iyer [11] use clustering as a 
step in creating transition models which are then used 
to predict resource usage. Hsueh et al. [12] use similar 
techniques to create performability models for a 
multiprocessor system. Ferrari [13], on the other hand, 
uses clustering in the creation of artificial workloads. 


The next section contains a discussion of the 
measured environment. Section 3 introduces the 
measurements used in this study. A number of 
preliminary results for the two samples are presented 
in Section 4. Section 5 describes the modeling 
techniques and presents the cluster and transition 
models obtained for the two samples. Section 6 
Summarizes the major results and suggests possibilities 
for future work. 


CONCUSER 
clsyst 
cluset 
CLUSTIM 
ipsyst 
CEUT 

IPUT 


Description 


% of time CEs clustered and running user code 
% of cluster time spent running system code 
% of cluster time spent running user code 

% of time spent in the cluster configuration 
% of time IPs spent running system code 
Utilization of entire CE complex 

Utilization of entire IP complex 


Table 1 


Measurement Descriptions 


2. THE MEASUREMENT ENVIRONMENT 


The measurements for this study were taken from 
real, scientific workloads being executed by an Alliant 
FX/8 on weekday afternoons. The FX/8 is a 
multiprocessor mini-supercomputer with a 32 
Megabyte shared global memory [14]. It can best be 


understood as two complexes or clusters” of processors. 
The main complex, the Computational Element (CE) 
cluster, consists of eight processors. These either work 
concurrently in the "clustered" configuration or 
separately in the detached configuration. When the 
CEs are detached, they can be used as eight separate 
processors working on different jobs, or groups of them 
can be used to multiprocess the same job. When in the 
clustered configuration, the concurrency control bus 
synchronizes the eight CEs to concurrently process a 
single job. 


The second complex of processors on the measured 
Alliant consists of three Motorola MC68012 
microprocessors called the Interactive Processors (IPs). 
In the measured system, the IPs handle all accesses to 
secondary memory and interactive user work such as 
editing jobs. It is important to note that the operating 
system on the measured machine is Xylem, which was 
specifically designed for the Cedar supercomputer, and 
not Concentrix, Alliant’s operating system. For this 
reason, this paper is more an analysis of a single cluster 
Cedar supercomputer, and less of an analysis of the 
Alliant FX/8. 


The measured FX/8 is used for application and 
algorithm development at (CSRD). This diverse 
environment is representative of many scientific, 
parallel program developmental situations. The 
measured programs include those specifically designed 
to optimize the concurrency allowed by the Alliant’s 
architecture along with jobs that were suboptimal. 


3. MEASUREMENTS 


Two software facilities developed at UICSRD were 
used to measure system behavior. The facilities 
monitored the system concurrently so both types of 


The use of the word cluster is admittedly overused in this 
paper. The Alliant FX/8s are clusters of the Cedar, while 
the FX/8’s have their own clusters. Later, cluster models 
will be introduced. This confusion was inevitable, in order 
to maintain consistency with the results in the other 
literature on these subjects. 
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measurements collected 


simultaneously. 


were 


approximately 


The first facility was used to measure the amount 
of concurrency in the workload. It used a high 
resolution (10 microsecond) timer to measure the 
amount of time each processor was executing system 
and user code, as well as the amount of time each 
processor was idle. These measurements were taken 
separately for the two CE configurations (i.e., detached 
and clustered). The percentage of time the CEs were 
clustered and executing user code (CONCUSER) was 
then determined. The CONCUSER parameter thus 
measures user concurrency in the workload and should 
be high for observations with well-tuned applications 
running. 


The second software facility measured the 
overhead associated with virtual memory and system 
operations such as paging, swapping, system calls, 
context switches, and file searches. Of approximately 
150 meters available, those presented in this paper are 
context switches, page-ins, and page-outs. Page-ins are 
defined as the number of disk accesses to bring pages 
into main memory. Correspondingly, page-outs are the 
number of separate disk accesses for writing back to 
disk. It should be noted that the O/S facility does not 
provide separate measurements for each processor, but 
running totals for all the processors combined. 


All measurements discussed above were sampled 
approximately simultaneously every 45 seconds. In 
addition to these measurements, the parameters 
summarized in Table 1 were calculated. Notice that 
some of the percentages in the table are calculated over 
the entire 45-second period, and others are calculated 
just over the time spent in a specific configuration. 


Each 45-second period is one observation of the 
system, and the measurements collected during that 
period depict the state of the system for that 
observation. The length of the observation was 
experimentally determined and chosen so that it 
would best correspond to the length of an actual, 
physical state of the machine. 


Several workload samples were collected for this 
study. In this paper, two markedly different workload 
samples are presented. The first sample was taken over 
a 138-minute period. The second sample, on the other 
hand, is 168 minutes long. To provide a broad 
understanding of the two workloads and _ their 
interactions with the system, some preliminary 
Statistical analysis is presented in the next section. 


context switches 1782.508 
device interrupts | 20389.339 


Table 2 


page-ins 0.109 
page-outs 1.869 
CE utilization 0.723 
IP utilization 0.304 
CLUSTIM 71.632 | 
cluset 90.165 
clsyst 9.835 
ipsyst 23.716 
CONCUSER | 64.625 


: peaoes rome ie Sample One | | | Sample Two | 
|_mean__| std.dev. || mean | _ std. dev. | 


1503.382 665.230 | 


| 18459.958 | 11929.022 | 

24.747 | 66.278 
18.116 40.957 
0.393 0.246 
0.271 0.101 
63.920 14.971 
39.879 34.000 
21.727 15.159 
17.231 4.146 


27.028 | 26.028 


Measurement Means and Standard Deviations 


4, PRELIMINARY ANALYSIS 


4.1 Means and Standard Deviations 


Table 2 summarizes the means and standard 
deviations of each parameter studied. Sample One is 
characterized by high user concurrency and CE 
utilization. The relatively small standard deviations 
for the parameters indicate stable activity during the 
collection of the sample. Sample Two, on the other 
hand, is characterized by low user concurrency and CE 
utilization. The standard deviations for this sample are 
high (e.g. see CONCUSER) indicating that the sample 
captured a workload consisting of bursts of work 
surrounded by idleness. 


The table also shows an imbalance between the IP 
and CE utilizations for both samples. This imbalance, 
especially for Sample One, may be partially attributed 
to the low paging activity. (All accesses to disk must 
be made through IPO, thus when the paging is low the 
IP utilization tends to be low.) Another cause of this 
imbalance is Xylem’s scheduling policy. Whenever 
possible jobs are scheduled on the CEs because they are 
much faster than the IPs. 


Table 2 also highlights the paging differences 
between the two sampled workloads. Sample One 
contains very little paging, while Sample Two has a 
substantial amount of paging. The standard deviations 
for the paging activities of both samples are quite high, 
suggesting intervals of high paging activity interspersed 
with periods of little or no paging. The periods of little 
paging activity are easily explained by the large 32-MB 
physical memory found in the Alliant. 


4.2 Individual System/User/Idle Times 


In this section the behavior of the individual 
processors is studied. Figures 1 and 2 show the 
system and user code, along with the percentage of 
time the processors are idle. The bars shown for the 
individual CEs, CEO-CE7, pertain to the time spent in 
detached configuration. The cluster bar (CL) shows the 
breakdown for the CEs’ utilizations while in the 
clustered configuration (only one bar is needed because 


all CEs work on the same job in this configuration). — 
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It is important to realize that these percentages are not 
calculated over the whole period, but only the period 
in which the CEs are in the specified configuration. For 
example, Figure 1 shows that while detached, CE7 is 
idle 45% of the time, executing system code 30% of the 
time, and executing user code 25% of the time. 


Figures 1 and 2 confirm the low utilization of the 
IPs. They also show that the work done on the IPs is 
evenly balanced. Note also the low utilization of the 
lowered numbered CEs while in the detached mode. 
The majority of the work in the detached mode is done 
by CE6 and CE7. These results suggest that a better 
design may be to allow the four lower CEs to form 
their own cluster. Thus, when the detached mode is 
needed, the upper four processors can break free and 
handle the work. Meanwhile, the lower four stay in 
clustered configuration and continue to service the jobs 
waiting on the cluster queue. 


In summary, the preliminary analysis shows that 
Sample One captured a system with high, steady CE 
utilization, little paging, and a high degree of user 
concurrency. This is the result of a relatively stable 
workload. Sample Two, on the other hand, is made up 
of observations with high variability in their CE 
utilization, and the amount of paging they capture. On 
average, the sample also shows. very little user 
concurrency. This is the result of a generally light 
workload with bursts of high activity. In addition, for 
both samples, the lower numbered CEs in the detached 
configuration and all three IPs showed low utilization. 


3. MODEL EXTRACTION 


In this section we extract state transition models 
to quantify the variation in system activity for each 
workload. Four parameters were selected to jointly 
characterize user concurrency and system overhead. 
These were IPUT, context switches, CONCUSER, and 
pagact (pagact = page-ins + page-outs, the total number 
of accesses to disk). Each observation is treated as a 
point in four-dimensional space. Statistical clustering 
analysis is used to identify similar classes (clusters) in 
this space. Each cluster is then defined as a system 
state, and a state transition model (consisting of 
intercluster transition probabilities) is developed. 
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Figure 1 
User/System/Idle times 


These transition probabilities may be used to predict 
forthcoming states of the machine. They also provide a 
solid understanding of the relationships between states. 


Next the extracted cluster model is interpreted by 
computing “skewness factors" for each transition. The 
skewness factors quantify the degree to which 
transitional relations between states were caused by 
random transitions. More specifically, a skewness 
factor determines the skewness of a etransition 
probability with respect to the transition probability 
that would be obtained if each inter-observation 
transition was equally likely. The skewness factor 
(S,,) of a transition from state i to state j is defined as 

ae observed number of transitions from state i to state j 
: probable* number of transitions from state i to state j 
*Assuming that the transition to any observation is 

equally likely regardless of the cluster it is in. 


The skewness factors bring out the effect of 
unbalanced clusters and quantify significant transitions 
between clusters. A significant transition is one that 
may an have underlying system-related cause, and is 
not just the result of random action. In some cases, 
small transition probabilities can mask these significant 
transitions. In other cases, the skewness factor may 
show that transitions which appear to be significant 
(because of high transition probabilities) may actually 
be explained by random transitions among states. A 
skewness factor near unity indicates that there is 
probably not a significant transition between states. 
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Figure 2 
User/System/Idle times 


Following this, skewness factors are calculated and 
used to detect significant transitional relationships 
between the states of the system. 


5.1 Clustering, Transition Models, Skewness Factors 


The cluster models were obtained using the 
FASTCLUS procedure from the SAS software package 
[15]. This procedure uses a K-means clustering 
method, grouping observations into clusters that 
minimize the intracluster distances between points, 
while maximizing the intercluster distances. All 
distances are Euclidean. 


The cluster models obtained are studied from 
three different perspectives, each providing different 
types of results. At the most basic level, the 
clusterings of observations are studied verbatim to 
determine the characteristics of the different states in 
which the machine is found. By the number of 
observations in each cluster, the percentage of time the 
machine is in each of these states may be determined. 
From this, the efficiency of the machine may be 
ascertained. 


The second form of analysis requires the creation 
of a state transition model, which consists of the 
probabilities for each intercluster (interstate) 
transition. These probabilities are easily estimated 
from the collected data with the following formula 
(P,, - probability of transition from state i to state j): 


observed number of transitions from state i to state j 


observed number of transitions from state i 


Cincun. he tme cL 6 ces 


The most desirable state, ie., high user 
concurrency, is captured by the observations found in 
clusters five and six. Cluster six contains observations 
with higher user concurrency, lower IP utilization, and 
fewer context switches than the observations in cluster 
five. It is interesting to note that the high user 
concurrency captured by observations in these clusters 
is accompanied by relatively low IP utilization and few 
context switches. 


The system is in a state of high user concurrency 
approximately 52% of the time (clusters five and six), 
with less, but still impressive amounts of concurrency 
being seen about 27% of the time (cluster four). The 
undesirable states (one and two) account for only 18% 
of the sample. : 


The transition model extracted for Sample One is 
shown in Figure 3. The high self-loop transition 
probabilities suggest that for all states (except state 
three, where the self-loop probability is only 0.2), 
there is a good chance the machine will operate in the 
same state during the following observation. The 
skewness factors confirm this relationship, and show 
that state three also has an affinity to return to itself. 
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Table 3 
Centroids of Clusters: Sample One 
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In summary, the skewness factor may be viewed 
as a validity test which provides a measure of 
credibility for the state transition model. In other 
words, it indicates whether there is any ’real’ 
information in the transition model, or whether it just 
captured random activity. 


5.2 Cluster Analysis for Sample One 


The cluster model extracted from Sample One is 
summarized in Table 3. Cluster one, depicts a system 
with a high context switch rate, relatively high IP 
utilization, and low user concurrency. Cluster two has 
similar characteristics, except they are not quite as 
extreme. The observations in these clusters most likely 
reflect a high degree of multiprogramming which may 
have reduced the concurrency exploitation. 


The third cluster, which only accounts for 2.73% 
of the sample, contains observations with considerable 
paging activity. As expected, the paging activity is 
accompanied by above-average values for both IP 
utilization and context switching. These observations 
also show a lower than average user concurrency. 


54/2.4 


Figure 3 
_ Transition Model: Sample One 
transition probability/skewness factor 


cluster number 


This is a good example of a skewness factor identifying 
a significant transition that the transition probability 
by itself would have masked. 


An interesting phenomenon brought out by the 
transition model is the lack of interaction between the 
high and low concurrency states. The only observed 
transitions into the high user concurrency state (six) 
were from states four, five, or six, which are other 
states depicting substantial user concurrency. This 
phenomenon is also seen for transitions into state five, 
the state depicting the second highest degree of 
concurrency in this model. Conversely, there are few 
observed transitions from the two high concurrency 
states (five and six) to the low concurrency states (one 
and two). Thus, it can be concluded that the machine 
does not experience sudden jumps from high user 
concurrency to low user concurrency, or vice versa. 
Transitions from these extremes are made by stepping 
through intermediate states, such as state four. 


The near unity skewness factors for all six 
transitions from state four indicate that the transitions 
from this state were almost uniformly distributed 
among the observations, regardless of the clusters 
obtained. Obviously, the behavior of the machine after 
being in this state would be the most difficult to 
predict. As hinted at above, state four acts as the 
dispenser, or lowest step, to the extreme states of the 
system. 


A final point of interest is the relationship 
between state one and state three as indicated by the 
skewness factors (but masked by the transition 
probabilities). The transition probabilities between 
these states are not very high, but the skewness factors 
are both 3.3. Recall that both states depict a system of 
low user concurrency, with state three also 
corresponding to high paging activity, and state one 
corresponding to high IP utilization. Thus the 
skewness factor is able to bring out an underlying 
system related dependency between paging and IP 
utilization. 


5.3 Cluster Analysis for Sample Two 


A summary of the cluster model extracted for 
Sample Two is presented in Table 4. The dominant 
cluster in the model is cluster two which accounts for 
almost half of the observations. Although the cluster 
depicts a near idle system, it should not be regarded as 
a weakness of the machine, but a consequence of 


Table 4 
Centroids of Clusters: Sample Two 
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monitoring real workloads. (Long periods of time 
passed with an extremely light workload while this 
sample was taken.) In the analysis, cluster two is 
ignored, when possible, because it reveals little about 
the system’s behavior under a substantial workload. A 
more revealing cluster is cluster one. The observations 
in this cluster show very little user ‘concurrency, 
relatively high IP utilization, and a large number of 
context switches. 


The desirable state, high user concurrency, is 
captured by the observations in clusters three and five. 
Cluster three contains observations with high user 
concurrency low IP utilization, little paging activity, 
and few context switches. Cluster five contains 
observations with similar, but less extreme 
characteristics. 


The paging activity that was first observed in the 
preliminary analysis is captured by the observations in 
clusters four and six. Of the two, cluster six contains 
the observations with the higher paging activity. The 
high paging is accompanied by high IP utilization, and a 
large number of context switches. It should also be 
pointed out that both paging clusters contain 
observations having low user concurrency, with cluster 
six (extreme paging observations) showing _ less 
concurrency than cluster four (medium paging 
observations). Note that for both samples, paging is 
seen to adversely affect the amount of user 
concurrency exploited. 


If we work under the assumption that cluster two 
contains only observations of the system under a light 
workload, we can discard these values for a quick 
analysis of the efficiency of the system under 
substantial workload. With the cluster two 
observations discarded, the percentage of observations 
for the other clusters is doubled. This puts the system 
in the desirable clusters (three and five) about 52% of 
the time, which is similar to Sample One. In addition, 
we find the system is in the paging clusters about 32% 
of the time, and in the undesirable cluster (one) about 
14% of the time. In summary, the analysis shows that 
while the machine was under a substantial workload, 
which was only about half the time, user concurrency 
was high at times, but not consistently high. 


The transition model for Sample Two is presented 
in Figure 4. As in Sample One, the transition 
probabilities and skewness factors are largest for self- 
loops. This indicates that the state of the system is 
fairly stable. 
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The skewness factors for the two paging clusters 
(four and six) are especially interesting because there 
are few nonzero values. The only way to get to state 
six (high paging) is through state four (medium 
paging), and the only way to leave it is again through 
state four. This stepping-stone effect goes even further. 
The only way to get to state four (beside itself or six) 
is through state five, the third highest paging state. 
Therefore, the system gradually builds up to high 
levels of paging and then gradually dissipates back 
down to nothing. In addition, as in Sample One, this 
stepping-stone effect is also seen for the user 
concurrency measurement. 


6. Conclusions 


In this paper we have presented an analysis of an 
Alliant FX/8 system running Xylem (Cedar’s operating 
system) at the University of Illinois Center for 
Supercomputing Research and Development. 
Preliminary analysis showed that the first workload 
sample was comprised of consistently high user 
concurrency, low system overhead, and little paging. 
The second sample captured much less user 
concurrency, but had significant paging and system 
overhead. In addition, it was seen that both the IPs 
and the four lowered numbered CEs, while detached, 
were underutilized. 


The results from the statistical models showed 
that during the collection of the first sample, the 
system was operating in states of high user 


04/.26 


180 


cluster 


- -56/12.4 


Figure 4 
Transition Model: Sample Two 


concurrency approximately 79% of the time. The 


second workload sample captured the system in high 
user concurrency states only 26% of the time. In 
addition, it was discovered that high system overhead — 
was usually accompanied by low user concurrency. 
The analysis also showed a high predictability of 
system behavior, for both workloads. This 
predictability was largely due to slow changes in 
system states. In particular, states with extremely 
high values of paging or user concurrency are usually 
preceded by states with less paging and user 
concurrency, much like stair climbing. A stepping 
down effect was observed when the machine left 
these extreme states. 


Future research will include cluster analysis of 
individual programs and benchmarks to determine 
their behavior on the system, and to further evaluate 
the techniques developed. Similar studies on other 
multiprocessor environments are also in the planning 
stages. 
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Abstract 


In this paper, we investigate the interaction of concur- 
rent database algorithms with the underlying multiproces- 
sor computer architectures. We implement an optimistic 
concurrent B-link tree access algorithm on two simulated 
multiple processor computer architectures: a shared sec- 
ondary storage system, and a processor-per-secondary stor- 
age system. It has been observed that the average de- 
gree of concurrency and the transaction throughput of the 
processor-per-secondary storage system are much greater 
than those of the shared secondary storage system. 


1. latroduction 


The performance of a distributed system is influenced 
both by the underlying architecture and the algorithms 
that control the execution of the system. Among the sev- 
eral architectural factors that influence the performance of 
a multiprocessor based distributed system, the interconnec- 
tion of processors and secondary memories is an important 
consideration. Similarly, the concurrency control algorithm 
employed by a-distributed system greatly influences the 
availability of the system to its users. Currently, we are in- 
vestigating the interaction between concurrency control al- 
gorithms and the underlying computer system components 
in a distributed system. 

We study two configurations of processors and memo- 
ries, each communicating on a shared interconnection net- 
work. Both systems contain processors that execute search 
and insert operations on a shared file indexed by a B-tree 
based structure. Due to its efficient sequential and random 
access mechanism, we chose the the B-link tree of Lehman 
and Yao [5]. 

Previous experiments in this area investigated a centrally- 
accessed data object (a B-link tree) shared between a set of 
processors [3]. The experiments presented here investigate 
the performance characteristics of a distributed B-link tree. 
The addition of these results allows us to compare two dif- 
ferent system architectures executing the same concurrency 
control methods. 

This paper is organized as follows. Section 2 describes 
concurrent access algorithms for B-tree systems. Section 3 
briefly describes an evaluation scheme for the B-tree con- 
currency algorithms. Section 4 describes the proposed im- 
plementation of the B-link tree system. Section 5 briefly 
describes the simulation parameters. Section 6 describes 
the evaluation metrics adopted in this paper. Section 7 
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presents the results obtained from the current experiments 
and compares it with the earlier results. Finally, Section 8 
has some concluding remarks. 


2. A Concurrent B-tree System 


Our focus in this paper is on evaluation of the opti- 
mistic B-tree system proposed by Lehman and Yao [5]. 
This system uses a modified form of B-tree, called a B- 
link tree, which provides multiple access paths to terminal 
data nodes. Users are provided with a high degree of con- 
current access to the shared tree through a limited set of 
operations, which include Search, Insert, and a simplified 
Delete (this Delete is a logical Delete — it removes key val- 
ues but does not cause tree restructuring). More complete 
discussions of B-tree algorithms can be found in [3,5]. 

The key performance parameters in the algorithm are: 
frequency of interference, cost of validation (a static quan- 
tity), and cost of recovery (a dynamic quantity). In a previ- 
ous study, interference and recovery of multiple processors 
accessing a shared B-link tree were measured [3]. Average 


- number of interferences per processor were found to ap- 


proach a constant as the number of processors increased. 
Also, it was found that an average of between one and two 
links were traversed during each recovery [3]. 


3. Concurrent System Evaluation 


Performance evaluation of concurrent tree algorithms 
have typically involved modeling and analysis, with high- 
level simulation used to provide additional support for an- 
alytical results. Analytical techniques are primarily used 
to predict (bound) certain aspects of concurrent algorithm 
performance. Markov chain models have been used to de- 
rive throughput and response time of static locking on a 
centralized database system [8]. Probabilistic analysis has 
proved useful in estimating the expected number of wait- 
ing updaters (processes), waiting readers, and the number 
‘of locks held by the processes [1]. Kung and Robinson em- 


ploy probabilistic analysis to compute the number of con- 
flicts between transactions that occur for locking protocols 
[4]. Because of the complexity of concurrent transaction 
systems and the simplifying assumptions necessary for us- 
ing analytical models, we concentrate on simulation exper- 
iments to measure system performance. 

Evaluation of a binary search tree shared among trans- 
action and maintenance processors using a two-phase lock- 
ing protocol has been performed in [6]. In this study, each 
processor has its own local memory and shares a common 


global memory. The average ratio of processor waiting time 
to total processor execution time is determined as a func- 
tion of the number of operations per transaction [6]. Lock- 
ing protocols for AVL-trees, 2-3 trees, and linear hashing 
[2] have been proposed and evaluated through analysis and 
high-level simulation; these evaluations have the average 
number of concurrently busy transactions as the metric of 
interest. 


4. User Transaction Managers and 


Shared Resource Managers 


A range of possible multiprocessor architectures may 
support a concurrent B-link system. We have narrowed 
this investigation to systems composed of user transaction 
managers (UTMs), shared resource managers (SRMs), and 
an interconnection network. 

A user transaction manager executes a B-link tree ac- 
cess algorithm on behalf of user transactions. There is 
one user transaction manager per processor to manage the 
transaction requests received at that processor (or network 
node). Each user transaction manager executes the same 
access method and uses the system interconnection network 
to access the nodes of the B-link tree(s). Each UTM coordi- 
nates the processing of a transaction by sending the access 
requests to shared resource managers, and processing the 
received B-link tree node data. 

A. shared resource manager maintains and controls the 


acccess to the B-link trees (or their partitions). The B-link 
tree is itself stored on a secondary storage system. The 
shared resource manager communicates with the secondary 
storage device to execute the requests received from the dif- 
ferent user transaction managers. An SRM grants locks on 
disk pages, unlocks pages, transfers pages to-and-from the 
secondary store, and communicates with the user trans- 
action managers through the interconnection network. All 
low-level hardware related secondary storage functions such 
as disk allocation and physical data format are hidden from 
the user transaction managers. 

In this paper, we discuss two alternative computer ar- 
chitectures to interconnect URMs and SRMs. These are 
shown in Figures 1 and 2. 

System 1 consists of multiple user transaction managers 
each communicating with a single shared resource manager. 
Communications between the managers is achieved over a 
shared network. In Figure 1, Pyim represents a proces- 
sor executing one of the user transaction managers. Sim- 
ilarly, Psrm represents a processor executing the one and 
only shared resource manager. Sy, , represents the shared 
bus system. Finally, M,, represents the secondary storage 
system that stores the B-link tree and the corresponding 
data associated with the keys in this tree [3]. 

In System 2, each processor in the muliprocessor system 
contains a user transaction manager and a shared resource 
manager. Each processor is associated with a dedicated sec- 
ondary storage device (M,,). The two separate functions 
of System 1, user transaction manager and shared resource 
manager, are multiprogrammed on the same processor in 
System 2. Thus Putm and Psrm are logical processors cor- 
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responding to one physical processor. The processors are 
interconnected on a shared communication network (S,,)). 


5. Simulation Parameters 


The performance of the application system is governed 
by the operation workload and the execution characteris- 


tics of each of the following components: the processors, 
the user transaction managers, secondary storage access, 
shared bus, and the shared resource managers. Figure 3 
describes one cycle of user transaction execution. This fig- 
ure describes a non-local access to a B-tree. It consists of 
three parts: processing by UTM, transmission of requests 
and replies through the shared bus, and the processing at 
the SRM. 

We consider two different workloads to represent dif- 
ferent kinds of B-tree applications: Search-intensive (70% 
Searches and 30% Inserts or 70/30) and Insert-intensive 
(30% Searches and 70% Inserts, or 30/70). The previous 
investigations revealed that it is sufficient to have 25 trans- 
actions for each experiment. We decided to have 100 trans- 
actions from each processor in the system per experiment. 
Each experiment was repeated four time to provide a 95% 
confidence level. 

Specific event timings in the simulator are based on pub- 
lished component speeds in a Hewlett-Packard HP Series 
200 SRM configuration. In order to relate processing and 
data access costs, the processing speeds of each statement 
of user transaction manager’s code are expressed in terms 
of the time required to write one word to disk. This corre- 
lates with other experiments performed with the simulation 
system [7]. 


6. Performance Measures 


In order to compare the performance of System 1 and 
System 2, we have selected the following metrics: cyclic pro- 
cessing power, system throughput, degree of concurrency, 
average waiting time for the bus, and the average waiting 
time for SRM. We have also measured the number of in- 
terferences, the average time for a search transaction, and 
the average time for an insert operation. Due to space lim- 
itation, all these measurements are not presented in this 
paper. Except for cyclic processing power, the rest of the 
metrics assume the usual meaning and hence are not ex- 


plained here. 


6.1. Cyclic Processing Power 


In order to provide an algorithm-independent measure 
of the effectiveness of the concurrent algorithm in a par- 
ticular application; we use the concept of cyclic processing 
power (CPP) described by Vrasalovic, et al. [9]. Intutively, 
cyclic processing power measures the percentage time a pro- 
cessor actively executes its user program, as opposed to the 
time it waits for service from a system resource. 

Let us consider the processing cycle in Figure 3. The 
local processing (in each cycle) that a user request initially 
receives at a processor is denoted by tp. After the initial 
processing, if the initiating processor decides to access a 


non-local shared resource manager, then the corresponding 
user transaction manager attempts to access the shared 
bus. The waiting time to access the shared bus is indicated 


by twsus- The time to completely transmit the request and 


release the bus is indicated by t,. The time for the remote 
processor (or SRM) to execute this request and send the 


reply is indicated by tyspm. As shown in Figure 3, tusrm 
consists of four components of time: wait time in the SRM’s 


queue prior to processing, time to process the request, wait 


time before accessing a bus to send the reply, and time to 
transmit the reply. The total waiting time per processing 
cycle is defined as t,, = tubus + twerm: 

In case the initiator decides to execute the request lo- 
cally, there is no need to access the bus. Thus the time 
spent by the initiator to acess and transmit a request (tubus) 
and the time spent by the SRM to access the bus and trans- 
mit the reply are avoided. 


If we have n processors (or UTMs) in the system, then 


the average cyclic processing power per processor may be 


defined as: 


n ta, +tp, 
2=1 ta, ttp, +tw; 


1 


CPP = (1) 
where the subscript 2 indicates the measurements corre- 
sponding to the processing cycle at the z“* processor. 


7. Simulation Results 


Tables 1-4 summarize the measurements of access time 
(t,), processing time (t,), waiting time for the shared bus 
(twous), the waiting time for the SRM, and the cyclic pro- 
cessing power (CPP). These timings represent measure- 
ments averaged over all processors and over all the trans- 
actions for a given number of processors (n). CPP is com- 
puted from the other average values using Equation (1). 


1. Degree of Concurrency: As shown in Tables 3 and 
4, processors of both systems show a decrease in de- 
gree of concurrency with an increase in the number 
of processors. From Tables 1-4 it may be observed 
that System 1’s concurrency is reduced primarily due 
to the time spent waiting for the SRM to send the 
results of a command. In System 2, however, concur- 
rency decreases with the addition of processors (and 
their transactions) due to competition for access to 
the shared network. 


2. Throughput: Tables 3 and 4 summarize the through- 
put measurements for Systems 1 and 2. As the num- 
ber of processors increase, the parallel execution of 
SRM operations boosts the throughput of System 2 
higher than System 1. For System 1, the through- 
put decreases with the number of processors. This is 
attributed to the contention at the single SRM. For 
System 2, however, the throughput increases with the, 
number of processors. The throughput of System 2 
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should keep increasing until enough processors and 
transactions are added to cause high contention on 
the shared network for remote SRM requests. At this 
point the migration of operations to data rather than 
the movement of data to operations may be a good 
approach. | 


3. Shared Bus Access: From Table 1 it may be observed 
that tubus is much higher in System 2 as compared 
to System 1. This is attributed to the increased pro- 
cessing activity at the processors in System 2 due to 
the distribution of the B-Tree. In System 1, the pro- 
cessors spend most of the time waiting for the SRM’s 
reply, and hence there is less contention on the bus. 
The tybus, however, does not linearly increase with 
the number of processors. This is also clear from the 
throughput statistics in Tables 1 and 2. 


4. SRM Access: From Table 1 it is clear that the shared 
resource manager is the bottleneck in System 1. The 
waiting time for a user transaction manager to await 
a reply from the central SRM continues to grow with 
the number of processors (or requestors). This wait- 
ing time also increases with the number of insert 
transactions in the input mix. The possible node 
splits and the resulting increase in read/write actions 
on B-tree nodes can explain this phenonmenon. Due 
to the distributed nature of the B-tree processing, 
shared resource manager is no longer the bottleneck 
in System 2. 


5. Cyclic Processing Power: The cyclic processing power 
(in both Systems 1 and 2) decreases with the increase 
in the number of processors. This is due to the idle 
time of the processors due to either waiting for the 
reply from SRM (in System 1) or due to contention for 
the bus (in System 2). Over all, the cyclic processing 
power of System 2 appears to be slightly higher than 


that of System 1. So by speeding up the shared bus, | 
we can reduce the waiting time for bus access, and in 
turn increase the cyclic processing power of System 
2. No such improvements are possible for System 1. 


8. Conclusion 


In this paper, we described the results that we obtained 
during the implementation of a B-link tree system on a mul- 
tiprocessor system with two different architectures. System 
1 implements the B-tree on a single processor with sec- 
ondary memory. System 2 implements a distributed version 
of the B-tree. 

The current investigations have lead to several interest- 
ing (some may be obvious) observations. Without much 
additional hardware or software cost, performance of con- 
current B-link tree operations can be improved dramati- 
cally. System 2 requires additional secondary storage de- 
vices (with divided capacity). This cost is more than bal- 
anced by the increase in performance (throughput and trans- 
action response time). Since each node in System 2 con- 


tains a copy of SRM and UTM software (the same as in 
System 1), costs for software maintenance should be simi- 
lar in both systems. Thus, we conclude that parallel disk 
access and multiprogramming of UTM and SRM functions 
make System 2 far superior. 
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Abstract : we study topological properties of multistage intercon- 
nection networks. We state a graph characterization of all the networks 
topologically equivalent to the Baseline networks and we explain why 
networks defined by some types of permutations are equivalent. 
Independent Connections are the link between graph theory and net- 
work definition using a numerical characterization for adjacency rela- 
tionship. We establish that Banyan networks built with independent 
connections are topologically equivalent. We also consider the PIPID 
field, an useful set of permutations, which allow the construction of the 


usual multistage interconnection networks, and which are easily 


modeled by independent connections. 


1. Introduction 


Several multistage interconnection networks have been proposed for 
communication in parallel architectures. They are typically designed 
using at least n=log.(N) stages of N/2 2X2 switching cells to connect 
N inputs to N outputs [8]. Topological properties of these networks 
have been extensively studied, as only few parameters (number of 
stages, type and number of cells, connections between stages) may 
drastically change their functionalities. Topological equivalence 
between the "classical" networks (Omega [11] , Flip [3] , Indirect 
Binary Cube [14] , Modified Data Manipulator [6] , Baseline and 
Reverse Baseline (see Fig. 1) [7]) has been proved by Wu and Feng [7] 
who have exhibited one to one mappings of the nodes between each 
network and the Baseline network. 


Another approach consists in modeling the networks by graphs or 
directed graphs. Such an approach was considered by Agrawal in [2] 
(see also [1]). He proposed a characterization of this class by "Buddy 


Properties"; unfortunately, the assertion of Theorem 1 of [2] is not 


sufficient to prove equivalence as it has been stated in [5]. 


Kruskal and Snir [10] within the graph theory framework, used a label- 
ing schemes to describe routing in the network. They defined a net- 
work isomorphism as a graph isomorphism, which furthermore 
preserves the vertices labels. They obtained a sufficient condition, 
called bidelta property, to insure that a network is isomorphic, in their 
sense, to the classical ones. 


Extending Agrawal’s property, we obtain a graph theoretical characteri- 
zation of topologically equivalent networks [4] using connected com- 
ponents of families of subgraphs which is unfortunately difficult to 
apply to the networks definition. The aim of this paper is to show the 
relation between our graph characterization and the usual definitions of 
Multistage Interconnection Networks using a set of permutations. 


In section II, we introduce the notations, and we state the characteriza- 
tion in terms of graph theory. Section III is devoted to the study of 
Independent Connection : our link between graph theory and networks 
definition using permutations. In section IV we consider the PIPID 
field, an useful set of permutations which allow the construction of the 
six "classical" networks. PIPID permutations on N symbols are defined 
by a permutation of the index digit of the binary representation of these 
symbols. We show that PIPID permutations used to built Banyan net- 
‘works may easily be modeled by independent connections. The main 
result of this paper is to establish that banyan networks built with 
PIPID permutations are topologically equivalent to the Baseline net- 
work. Note that the six networks studied by Wu and Feng [7] are 
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designed using a subset of PIPID and that such a design allow an 
efficient bit directed routing. 


2. A Graph Characterization 


Interconnection networks may easily be modeled by directed graphs 
(digraphs) in which nodes represent the switching cells and arcs the 
communication links. We do not add extra nodes for the inputs and the 
outputs of the network as they do not play any role in the graph iso- 
morphism. 

Let C be a set of nodes, we will denote by I *(C) the set of children 
of nodes in C, and by I “(C) the set of parents of nodes in C. 


A multistage interconnection digraph (MI-digraph) with n stages is a 
digraph whose nodes are partitioned into n ordered stages. We denote 
by V; the nodes of the i® stage. There are arcs only from nodes of the 
i® stage to nodes of the (i+1)” stage (i.e. from V; to V;4;). The nodes 
are of indegree 2 and outdegree 2 except the nodes from the first and 
the last stage. 


With this definition, we say that two multistage interconnection net- 
works are topologically equivalent if and only if their MI-digraphs are 
isomorphic. Two digraphs are isomorphic if and only if there exists a 
bijection from the nodes of the first digraph into the nodes of the 
second digraph, which preserves the relationship of adjacency. 


Fig. 1 : Baseline Network and Baseline MI-digraph 


Remark : In all figures, the arcs are directed from the left to the right. 


Banyan Property Definition : One minimal requirement is to allow a 
connection between any pair of input and output nodes. We say that a 
network has the Banyan Property if and only if for any input and any 
output there exists a unique path connecting them. 


Definition : The connected components of an MI-digraph are those of 


the undirected underlying graph, obtained from the digraph by deleting 
the arcs orientation. 


Definition : We denote by (G),. j the subgraph of G that contains the 
vertices of the stages from i to; : ViVinl +: LV; 


P(i,j) Property Definition : We say that an MI-digraph with n stages 
satisfies the P(i,j) property for 1<i<j<n if and only if the ‘subdigraph 
(G);,; has exactly 2"-''Y~) connected components. And we say that 
an MI-digraph satisfies property P(1,*) if and only if it satisfies P(1,j) 
for every j such that 1 <j <n. Similarly it satisfies property P(*,n) if 
and only if satisfies P(in) for every i. 


Using this notations, the next theorem states the weakest condition of 
topological equivalence for multistage. interconnection networks. 


Theorem : Ali the MI-digraphs with n stages satisfying the Banyan 
property P(*,n) and P(1,*) are isomorphic. Although these properties 
are easy to check, the proof is too long to be included here; it will 
appear in [4]. The proof is done by induction, using the left and right 
recursive construction of the Baseline to design the isomorphism. 


The assumptions of the theorem are very easy to check numerically 
using breadth first search algorithm to compute the number of con- 
nected components and the number of nodes at distance k. Unfor- 
tunately, these conditions are hardly related to numerical definitions of 
multistage interconnection networks (i.e. the permutations realized at 
each stage). For instance, the Omega network is defined by n_ perfect 
shuffles, and it is not obvious to understand why this type of definition 
implies the P(1,*) and P(*,n) topological properties. 


In the next section, we define independent connections as a pair of 
mappings satisfying numerical constraints. We prove, using the former 
theorem, that banyan graphs built with these connections are iso- 
morphic to the Baseline MI-digraph. Furthermore, we show that the set 
of permutations on N symbols, defined by a permutation of the binary 
digits of the symbol representation, may easily be associated to 
independent connections. Like the perfect shuffle, permutations used to 
design multistage interconnection networks often exhibit this property, 
and the equivalence relationship between "classical" networks becomes 
obvious. | 


3. Independent Connection 


As we consider networks defined in term of permutations on N sym- 
bols, we add a labeling of the nodes in the graph. At each stage, nodes 


are labeled from 0 to N~1=2""!-1, following the natural order of the 
drawing (Fig. 2). The label of a node is a n—1 tuple (x,_},..,«;), in 
base 2, SO (X,-1»-.%1) € [Z)z]""'. We consider the usual addition in 
the field ([Z)571"",+). 


(0,0,0) (0,0,0) 
(0,0,1) (0,0,1) 
(0,1,0) (0,1,0) 
(0,1,1) (0,1,1) 
(1,0,0) py (1,0,0) 


(1,0,1) (1,0,1) 


(1,1,0) (1,1,0) 


(1,441) P AREY 


Fig.2 : Labeling of an MI-digraph 


Now, we define connections and independent connections. The major 
result of this section is that banyan networks built with independent 
connections are topologically equivalent. 


Consider a MI-digraph G and its subgraph (G);;,;. Recall that this 
Subgraph is a bipartite graph consisting in two consecutive set, V;, Vay 
Of nodes labeled in [Z,97V"", and a set of arcs from V; to V;4;. 

Definition of a Connection : For all i#n, a connection (f, 2) from 
the i-th stage of the MI-digraph G is a pair of functions f and g 


defined on [Zj7]"" such that, if x is a node of the i-th stage of G (i.e. 


V;) then the two children of x in the i+1-th stage (i.e. Via1) are f (x) 
and g(x) (Le. T *(&)=(f (x),2 (x))). 
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Such a decomposition of the adjacency relationship exists as the outde- 
gree of a node is always two, except in the last stage. — 


Definition of an Independent Connection : a connection (f ,g) is 
independent if and only if - 
¥ae Z2Z,a#(0,.,0), FB € (Zgz]"" such that 
¥x € [Zo7l", 
f@+oa)=B+f() and g~ +a)=B + g(r) 
We exhibit in section IV some examples of independent connections. 


Proposition 1 : A banyan graph built with independent connections 
satisfies the buddy property [2] (i.e. the interconnection pattern between 
nodes of consecutive stages is the K25 graph). 


Proof : Let x and y be two nodes in V;, such that x and y share one 
neighbour in V;,,. As connection (f,g) between V; and V;,, is 
independent we have 


hia) = hoy). 

where h, and hy» are either f or g. Indeed x and y share one neigh- 
bour, but we do not know if this child is obtained by function f or g. 
Let us denote by a the difference between y and x, then there exists B 

hAyx+a)=h1Y)=h14) +8 

hax + O) = hoy) = hax) +B 
Then ho(x) = hoy) — B = hy(x) - B= 4). 
Hence x and y share two neighbours in row V;,;, and the MI-digraph 
Satisfies the Buddy Property. Following Agrawal in [2], we define x 


and y as buddy nodes. Note that f (x) and f (y) are buddy nodes two. 
Indeed buddy nodes share two children or two parents. 


We give now a definition and a technical lemma that help us to prove 
one of the assumptions of theorem 1 : the P(1,*) property. 


Definition of a translated set : Let A be a subset of V;, and v a vec- 
tor in Zaz)", we call the v-translated set of A, the set of nodes 


{a; + v} when a; takes all values in A. 


Lemma 2 : Consider an independent connection (f ,g). Let A be a 
subset of V; such that the number of nodes in both A and I’ ~(A) is 2*. 
Let v be an arbitrary vector of [Z).7]"". If B is a v-translated set of 
A, then I ~(B) has 2* nodes and is a translated set of I’ ~(A). 


Proof : 


As A and I (A) have the same number of nodes, all nodes of A are 
buddy nodes. Similarly all nodes in I ~(A) are buddy nodes. Let a! 
and a” be two buddy nodes in A, and b! and b? their v-translated 
nodes. We first show that b! and b? are buddy nodes too. We have, 


fo')=fa@i+v)=f(a')+w fori=12 
a(b')=g(@'+v)=e(a')+w fori=12 
as a' and a? are buddy, we have 
h(a’) = h2(a”) 
where both A, and hz are either function f or function g. Therefore : 
hy(b*) = hy(a') + w = h(a) + w = hp(b?) 
Hence b! and b? are buddy nodes. 


Now, we terminate the proof by showing that [ ~(B) is a translated set 
of I’ “(A). Let x be a node in A, let y be the v-translated of x, let w 
(resp. z) be a parent of x (resp. y). We have : 


Ay(z)+v =how) 
where both h, and h, are function f or g. 


Then, let u be an arbitrary point of l “(B), and a = u — w. We prove 
that node u + z — w isin T (A). 


hou) = how + a) = how) + BC) = hy (z) + v + BQ) 
According to the definition of an independent connection we have : 
hy(z + a) = B(@) + hi(z) | 


(A) 


Fig. 3 : Construction of the sets 


Therefore, 
hy(z Bs cr) = h(u) —vV 


For any node uw in I’ (B), nodes {[(u)-—v) are in A. Therefore, 
node z + @ is a node in I (A) and node u is w—z-translated node in 
[ (A). This is true for any node u, so T ~(B) is a (w—z)-translated 
set of T (A). 

| 


Lemma 3 : A banyan MI-digraph built with independent connections 
satisfies the P(1,*) property. 


Proof : the proof proceeds by induction. 


e Lemma 1 proves that such an MI-digraph satisfies property 
P(1,2). Indeed, Buddy Property and P(1,2) property are 
equivalent. We prove in the following that P(1,j/) implies 
P(1,j+1) under the assumptions of Lemma 3. | 


e Let x be a node of V, and K be the connected component of 
(G),,; containing x. Let Z be the set of children of x in Vj,), 
and A; be the intersection of K and V;, for all i, 1l<isj. 
According to the induction hypothesis, the number of nodes in A; 
is 2/-!. As the graph is banyan, each node of Z has only one 
parent in A; and each node of A; has two children in Z. 


100000, 


Fig. 4 : Construction of G 


Let B; be the set of buddy nodes of A;. We will prove that B; is 
a translated set of A;. Let a' and b’ be two arbitrary buddy 
nodes in A; x B j and let y be an arbitrary node in A;. 


a=y-a! 
fa@'+a)=f@')+B 
g(a'+a)=g(a')+B 

As a! and b! are buddy nodes, we have 


h,(a’) = h2(b') 
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where both h, and h, are function f or g. Furthermore 
h.(b' + cx) = ho(b')+B = hia! + a) 


Hence b' + a and a! + @ are buddy nodes, and B; is a (b'-a')- 
translated set of A;. 


Then, we apply Lemma 2 twice on A;, and B;. Indeed, accord- 
ing to the buddy property, the sets A; and A;_, have the same 
number of nodes, and according to the banyan property we have : 


T (Aj) = Aj-1 


As we denote by B;_; the set P “(8;), Lemma 2 implies that this 
set is a translated set of A;_; and has the same number of nodes 
than A;_; (i.e. 2’). We define B, as the set P “(By41). We can 
now apply Lemma 2 on A;-_; and B;_; to show that the set B;_» 
has the right number of nodes. By induction we prove this pro- 
perty on every set B,. 


© As the connected component of (G),; ;,, containing Z is exactly 
Z and K and the union of B, sets, we have shown that (G), 541 
has 2/*! nodes per stage. Hence the graph G_ satisfies the 
P(1,j+1) property. 
| 


Similarly we prove by induction the following lemma : 


Lemma 4: A banyan MI-digraph G built with independent connec- 
tions satisfies the P(*,n) property. 

Indeed Lemma 1 proves that G satisfies property P(n-1,n). We prove 
that the property P(j,n) implies PG-1,n) under the assumption of the 
Lemma. We apply the same technique than in Lemma 3: we decom- 
pose a connected component of (G)j,-;,, in a connected component of 
(G);, called K, the parents of K in V;_, called Z, and the others chil- 
dren of Z in V;. Then, we prove that at each stage, this set of children 
has the right number of nodes. The proof is too long to be included 
here. 


According to our graph characterization, we can state the announced 
result whose corollary are developed in the next section. We consider 
in the following section some connections defined by a permutation of 
the digital representation (i.e. the tuple (x,-1,...x;)) and we show the 
relations between these connections and the PIPID set of permutations. 
Fortunately enough, these connections are independent connections, 
allowing us to use our main theorem : 


Theorem 5 : A banyan MI-digraph built with independent connections 
is topologically equivalent to Baseline MI-digraph. 
g 


4. Pipid Permutations 
Consider now a labeling of the links of the network at the inputs and 


outputs of all cells following the natural order of the drawing. A label 


is a number between 0 and N—1 whose binary representation is denoted 
by (x,-1,--.%1,X0). Each link is defined by two labels and each stage is 
defined by a permutation of these N labels. 


Multistage interconnection networks have been often defined using 
these permutations and functional properties have been derived from 
this model [13]. For instance, the Omega network is defined as n 
stages of perfect shuffle. A perfect shuffle o may be defined as a circu- 
lar left shift of the binary representation of the operand. Similarly, the 
k-subshuffle o,, the k-butterfly, B,, and the bit reversal, p, are easily 
defined by permutations on the bits of the number representation (see 
[9] for more definitions). These permutations have been used to design 
the six networks studied by Wu and Feng and one may ask if this 


‘scheme of construction is the reason of the networks topological 


equivalence. 


Consider numbers from 0 to N-1 and their binary representation 

(Xp-1»-%1X0). Following Lenfant [12], we define PIPID permutations 

on these numbers, by a permutation on the index of the representation. 
X € PIPID(N=2") <—> J 8 permutation on n symbols such that 
MG n—1202X 1X0) = (Xecn—1)--X or1)-X00)) 


Perfect shuffle, bit reversal and butterfly are examples of Permutations 
Induced by a Permutation on the Index Digits (PIPID). We prove in the 
following that these permutations are also associated to a family of 
very simple independent connections. 


Compare now the label of the node or a cell used in section III and the 
labels of the links connected to the outputs of this cell as stated in the 
beginning of this section. One can obviously remark that the n—1 first 
bits of a link label are exactly the binary representation of the incident 
node label. 


Let A be an arbitrary permutation of PIPID used to design a stage of a 
network, and let @ be the associated permutation of the index. Let x be 
a node or cell label, x = (%,-1,..,¥;). The links connected to this cell 
are labeled : 
| y° = (X%p—15-»%1,0) 
y? = (Xq-12--% 1,1) . 
Applying permutation 4 on these two labels give the two labels of the 
links (z°,z') in the next stage. 
z° = Ayo) 
21 = 1g") 
Let k = 67'(0) and m = @(0). We have 
| oe (X o¢n—1)2++2% 6(&-+1)10-% 0(k—1)9++2% (1) 0(0)) 
2! = (Xocn—1) +X oe41)»1 Xace—1)>--s¥ a(1)-X0(0)) 
And, if we consider only the (n—1) first digits, we obtain the labels of 
the cells connected to cell x : 
(X o(n—1)>++2% (4-+1)0-% a(k—1)>++% a(1)) 
(Xo(n—1)s++% 0(k +1)91 X a¢e—1)2-+% a(1)) 
Now, we have to identify the two mappings f and g and to check that 
the connection (f ,g) satisfy the independence property. 


Note that we had supposed in the former equations that k is not zero. 
Indeed, we can give up this particular case as such permutations are 
not useful to build banyan networks. If k is zero, then there are two 
links between the cells, and the graph do not obviously satisfy the 
banyan property. | 

Let us suppose that k #0 and let @ denote the following permutation 
on [1..n—1]. 


¥Viz#tm oi) =O) 
o(m) =k 
To compute the labels of the cells connected to cell x, one just have to 
apply the permutation » on the index of the binary representation of x 
and force to 0 or 1 the k” bit. We suggest to use the following two 
functions f and g to design an independent connection which realize 
this operation. We define the two functions by their projections f; and 
g;, for alli, 1<i<n—1. 
f,%) =X VOU) #m 
8i(X) =Xyiy ¥OG) #m 
fi) = 14+%n 
&y(X) = Xm 
Therefore, the connection (f ,g) is independent. Let a be an arbitrary 


non zero vector, we obtain the vector B by applying the permutation > 
on the index of the binary representation of a. 
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So, we can associate independent connections to the PIPID permuta- 
tions used to build banyan networks. We have now an easy to check 
sufficient condition of equivalence with the Baseline network : all 
banyan multistage networks built with PIPID permutations are topologi- 
cally equivalent to the Baseline network. As Omega, Baseline, Reverse 
Baseline, Flip, Indirect Binary Cube and Modified Data Manipulator 
networks are designed using PIPID permutations, they are topologically 
equivalent. 


5. Conclusion 


We stated a characterization of Baseline equivalent networks using a 
graph model of multistage interconnection networks. As this character- 
ization is difficult to apply to networks defined by permutations, we 
design a new tool, independent connections. The independence pro- 
perty is a numerical constraint on the adjacency relationship. We 
derived from this constraint a characterization of permutations (PIPID) 
which can be used to build equivalent networks. As these permutations 
are associated to a very simple bit directed routing, they have been 
used to design most of the multistage interconnection networks 
presented in the literature. Note that the results obtained here apply 
only to networks built with 2x2 switching cells, whereas our graph 
characterization have been generalized to arbitrary size of cells. 
Finally, we hope that this approach will be useful to study others topo- 
logical or functional properties of multistage interconnection networks. 
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NONUNIFORM TRAFFIC SPOTS (NUTS) IN MULTISTAGE INTERCONNECTION NETWORKS 
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Abstract 


The performance of multistage interconnection networks 
with blocking switches is degraded when the traffic pattern pro- 
duces nonuniform congestion in the switches, that is, when there 
exist nonuniform traffic spots (NUTS). For some specific patterns 
we evaluate this degradation in performance and propose modifi- 
cations to the network organization and operation to reduce the de- 
gradation. Successful modifications are the use of diverting 
Switches and the extension of the network to include alternate 
paths. The use of these modifications to the basic blocking policy 
for control of contention makes the network more effective for a 
larger variety of traffic patterns. 


1. Introduction 


Multistage interconnection networks (MIN) are used in 
multiprocessor systems to connect processors with other proces- 
sors or with memory modules. These networks provide a 
compromise between networks of low latency and high cost, such 
as the crossbar, and networks of high latency and low cost, such as 
the shared bus. Moreover, MINs can be pipelined to provide a 
bandwidth comparable to that of the crossbar for suitable traffic 
patterns. In addition, the control of routing is simple. A large body 
of work has been done on the structure, operation, and perfor- 
mance of these networks; a comprehensive reference is [1]. These 
networks were initially introduced for use in array computers of 
the SIMD type; in this context the interconnection networks are 
sometimes called permutation networks. More recently, they are 
being proposed and used in multiprocessors of the MIMD type, 
especially of the shared-memory variety [2]. In this paper we are 
concemed with this second type of use. 


MINs, in their basic form, provide a unique path between 
any source-destination pair. However, the paths for different pairs 
are not disjoint and, therefore, conflicts might occur when simul- 
taneous communication is established between several source- 
destination pairs. The basic method used to handle this problem is 
to use a packet-switched type of operation and to buffer the pack- 
ets in the switches. Blocking occurs whenever the buffers become 
full. 


It has been shown that the performance of these networks 
is satisfactory for uniform traffic [3]. More recently, several stu- 
dies [4] have indicated that the performance of the network is de- 
graded significantly when the traffic includes hot-spot traffic, that 
is, when each source generates a larger fraction of the traffic to one 
particular destination. This type of traffic occurs because of access 
to shared variables, such as semaphores. To overcome this degra- 
dation, a network with combining switches has been proposed. 


The topic of this paper is a more general type of nonuni- 
form traffic, in which there is no concentration of the traffic to one 
destination, but the traffic is not uniformly distributed among the 
Switches, producing nonuniform traffic spots (NUTS). We illus- 
trate some typical cases of this type of traffic and show the degra- 
dation in network performance produced by them. We then explore 
solutions to reduce this performance degradation. 


Of course, in this case the use of combining switches is not 
a solution since the contention packets do not necessarily have the 
same destination. We show that randomization of the traffic, pro- 
posed for reducing contention in multicomputers [5], is not suit- 
able either. As positive alternatives to improve the performance, 


we consider the use of diverting switches, with several diverting 
policies, and networks with alternate paths. Because of the reduc- 
tion in degradation produced, the proposed modifications to the 
basic network with blocking make the multistage network suitable 
for a larger variety of multiprocessor applications. 


The performance of the proposed solutions is evaluated by 
simulation. The objective of this evaluation is to show that, under 
reasonable conditions, performance of the original network with 
blocking switches is badly degraded by the presence of NUTS and 
that the modifications proposed significantly reduce this degrada- 
tion. On the other hand, it is not our objective to give an extensive 
set of graphs from which the performance of particular networks 
with specific traffic patterns can be determined. Consequently, we 
select a set of reasonable network parameters and traffic patterns 
and use these for the simulation. More detail can be found in [6]. 


2. Multistage Network Structure and Operation 

We now give a brief description of the structure and opera- 
tion of the multistage network, emphasising the assumptions we 
make. A more detailed discussion can be found in [1]. The type of 
multistage interconnection network we are considering has N = 2” 
inputs (sources) and outputs (destinations). It consists of n stages 
of N/2 2x2 switches, as shown in Figure 1. The outputs of stage-i 
switches are connected to the inputs of stage-(i—1) switches, with 
the network inputs going to stage-(n—1) switches and the network 


outputs coming from the stage-0 switches. 
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Figure 1. An 8x8 Omega Network 


Several specific multistage networks have been proposed, 
differing in the interconnection pattern between stages. Since the 
characteristics, in terms of type of operation and performance, are 
similar for all these different topologies, we consider here the 
Omega network [7], which has been extensively studied [8] and is 
being used in several multiprocessor systems. 


The routing of packets in the network is unique since there 
is a Single path from a specific source to a specific destination. The 
control of routing is done using a destination tag associated with 
the message as part of each packet. 


Since each output can send only one message per cycle, 
(the network is synchronous and pipelined) there is a conflict when 
both packets entering a switch in a cycle have to be routed to the 
same Output. One solution to this conflict is to have a buffer for 
each output and to store the additional packet in such a buffer. Of 
course, these buffers are finite so it is necessary to have an opera- 
tion policy when the buffer is full. The basic scheme used is a 


blocking policy in which the predecessor switches do not send 
packets to a full buffer. To support this policy it is necessary to 


have signals from a switch to its predecessors indicating that the 
corresponding buffer(s) is full (Figure 2). Note that since both 
predecessors can send messages to the same buffer, it is necessary 
to establish a policy also for the case in which there is just one 
Space in the buffer. In such a case, we select alternatively the 
predecessor that is blocked. 


To switch buffer full bufferd_full From switch 
feeding input 0, buffer 7 buffert full fed by output 0 
Input 0 ; : J Output 0 
2 | ® 
o 5 
eae | & 

e 
Input 1 5 = Output 1 
To switch From switch 


feeding input 1< outer ul fed by output 1 


Figure 2. Blocking Switch With FULL Control Signals 


In this paper we do not evaluate the different buffering or- 
ganizations and policies. The degradation produced by NUTS is 
inherent to the blocking operation of the network, which is present 
for any of the buffer organizations and policies. Consequently, we 
perform our analysis using output buffers with FIFO policy. but 
the 


When processors send request packets to remote memory 
modules, traffic in the opposite direction is also generated. These 
return packets must traverse an analogous network to reach the 
processors. The analysis of this type of traffic. is similar to the re- 
quest traffic, and is not considered here. 


3. Performance evaluation by simulation 
We now describe the measures that we will use to evaluate 
the performance of the network. We also indicate the types of traff- 
ic and network parameters considered. As discussed in the intro- 
duction, we select a reasonable set of parameters and perform 
simulations to compare the performance for the original network 
and for the modifications proposed. 


Of importance in our study are the different traffic pat- 
terns uSed, since the degradation due to nonuniform traffic spots 
(NUTS) and the applicable solutions depend on the traffic patterns 
considered. In the next section we present the patterns used. 


In addition to the traffic pattern, the traffic load is of im- 
portance. We distinguish two types of systems: open and closed 
systems. In an open system, each processor generates a packet 
each r cycles, so that the load is specified by the fraction 1/r.Ina 
closed system, on the other hand, the load is defined by the max- 
imum number of outstanding packets. We have found that the 
results are qualitatively similar for both cases for the same total 
throughput. Consequently, to concentrate on significant parame- 
ters, we only report on results for open systems. 


We evaluate the steady-state behavior of the system, that 
is, we assume that the traffic pattern under consideration remains 
for a period long enough to achieve this steady state. 


The fundamental parameters for the network are its size 
and the size of the queues. We have found that the relative perfor- 
mance of the network remains essentially the same for different 
values of these parameters. Consequently, we report our results for 
a network of size 64 and queues of size 2, which are also con- 
venient because they produce a relatively small delay, except for 
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the blocking case where, because of the way the full signal is gen- 
erated, this size of queue is not adequate. In this latter case, we use 
a queue of size 4. 


The main performance measures of interest are the 
throughput of the network in packets/cycle, and the average de- 
lay of the packets. The maximum throughput is of N packets per 
cycle and the minimum delay is of cycles. This performance is 
obtained when there are no conflicts, that is, when in all cycles all 
Switches receive two packets and route one to each output. For oth- 
er cases, the performance is shown by the function delay vs. 
throughput. 


In a multiprocessor system all processors cooperate in the 
execution of a task and have to synchronize periodically. Conse- 
quently, it is convenient for all processors to advance at a uniform 
pace, so that processors do not have to wait unnecessarily for 
slower processors. The measure we use to evaluate the relative ad- 
vance of the processors is the distribution of throughput. 


For the simulations we built a network simulator using as a 
basis SIMON, a general-purpose multiprocessor simulator 
developed at the University of Utah [9]. 


Several studies have been reported on the performance of 
multistage interconnection networks with uniform traffic [3]. The 
results of our simulations for uniform traffic confirm what previ- 
ous studies have indicated. 


4. Traffic patterns producing NUTS 

The evaluation studies that have been made for the “‘hot 
spot’’ problem point to a more general situation with nonuniform 
traffic. The same type of degradation should occur whenever the 
traffic is such that one or more switches carry a larger fraction of 
the total traffic than its share. This degradation is due to the same 
‘tree saturation’’ effect observed in the hot-spot case. In the con- 
text of the Omega network, switch i of stage j carries the traffic 
going from a specific subset of 2/ sources to a specific subset of 


destinations (Figure 1). Consequently, switch congestion occurs 
whenever this traffic is excessive. This can occur even in situations 
in which the fraction of traffic going to each destination is the 
same. The main objective of this research is to identify the traffic 
patterns that produce non-uniform traffic spots (NUTS), to evalu- 
ate the degradation in performance, and to propose and evaluate 
solutions to this problem. 


To study the influence of NUTS on performance we have 
considered two types of traffic as follows. These types are just ex- 
amples to illustrate the problem and evaluate the solutions; they 
correspond to situations that could occur, but are not specific prac- 
tical patterns. 


Traffic of Type I. 


In the first type, each source issues all its requests to one 
destination and no pair of sources sends to the same destination. In 
the shared memory case, this type of pattern models a system in 
which each processor has a preferred memory module that contains 
both the code and the data for that processor. It might be argued 
that in such a case it would be better to assign to each processor a 
local memory module with direct access without going through the 
network (this is the scheme used, for example, in the BBN Butterf- 
ly). However, the use of the network to have a uniform access time 
from any processor to any memory module, permits a flexible 


dynamic scheduling approach that is not possible in the local 
scheme. In this dynamic scheduling model, a processor can have 
its code/data in any memory module, and this module can vary 
with time. This type of traffic would model also situations in 
which the communication is among pairs of processors. 


Some specific instances of this type of traffic patterns pro- 
duce significant NUTS while others do not. We use two different 
instances for our study. In instance 1, we use-a bit-reversal per- 
mutation which is known to produce a large contention in the 


Omega network as evident from the switch positions shown in Fig- 
ure 3. This is an extreme case, it shows a lower bound on the im- 
provement that can be achieved with the techniques used. To 
model a more typical situation, as instance 2 we generated an arbi- 
trary permutation. 


0 0 
1 1 
2 2 
3 3 
4 4 
5 5 
6 6 
7 7 


Figure 3. Switch Positions for the Bit Reversal Permutation 


The throughput-delay for these patterns is shown in Figure 
4. As can be seen, there is significant degradation in performance, 
as compared with the uniform traffic case. 
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Figures 4 and 5. Simulation Results for Blocking Switches 


Moreover, in case of the arbitrary permutation there is a 
large variation between the throughput of the different processors 
(Figure 5). As mentioned before, this is not desirable when the 
processors are cooperating in a single task. 


Traffic of Type II. 


The second traffic pattern we consider consists of requests 
going from even numbered sources to destinations in the first half 
and from odd numbered sources to destinations in the second half 
(EFOS). This pattern serves to illustrate a case in which each 
source accesses a subset of destinations. In this case, there are also 
NUTS. The performance of the net is shown in Figure 4. 


We conclude from these simulations that the performance 
of the network is badly degraded by the NUTS, with respect to the 
performance for uniform traffic. We now explore ways to reduce 
this degradation. 


5. Unsuccessful solutions: randomization and discarding 
We now report on randomization and discarding, two ap- 
proaches to reduce the degradation due to NUTS which turned out 
to be unsuccessful. 


Randomization 


As a first solution to the degradation due to NUTS, we 
consider the use of randomization of the traffic. In this approach, 
proposed previously to handle load imbalances in routing of multi- 
computers [5, 10], packets are first sent to random destinations and 
then rerouted to their final destinations. This scheme has the effect 
of making the traffic pattern uniform and, therefore, of eliminating 
the added congestion of nonuniform traffic. 
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Figure 6. Randomization 


In the context of multistage networks, the use of this 
scheme implies that all messages make two passes through the net- 
work. This has the two negative effects of doubling the minimum 
delay and reducing the effective throughput to half, because of the 
additional traffic through the net produced by rerouting. The 
results of simulations for the two types of traffic described in the 
previous section are shown in Figure 6, which exhibits the expect- 
ed throughput and delay. As seen from there, randomization pro- 
duces a relatively small improvement for the extreme bit-reversal 
case, while it is detrimental for the others. 


Discarding Switch 

Another solution we considered was the use of discarding 
switches. Switches of this type resolve congestion by discarding 
overflow packets. The original source of the packet is made aware 
of the status of the packet, through an explicit signal from the 
switch or a timeout mechanism, and retransmits it. Note that this 
requires the source to buffer all outstanding packets until an ack- 
nowledgement is received from the destination. The switches also 
require the ability to signal the appropriate source that a particular 
packet was discarded. This requires additional interconnect and 
more complex control. This type of switch is used in the Butterfly 
Parallel Processor to deal with contention in the network and avoid 
““tree saturation’’ [11]. 


The simulations show that, for the traffic patterns con- 
sidered, there is no improvement with respect to the network with 
blocking switch. This can be explained by the fact that the discard- 
ed traffic is reissued by the same processor as the first time and, 
therefore, follows the same path leading to the NUTS. 


6. Diverting Switch 

In a diverting switch the messages in front of the buffers 
are always sent to the successors, irrespective of whether there is 
space for them in the corresponding destination buffers. If both 
messages that arrive to a switch go to the same output buffer and 
there is no space for both, then one of the messages is diverted to 
the other buffer of the switch (Figure 7). Note that there is always 
at least space for one message in each buffer since one message 
departs from each buffer in every cycle. Of course, the diverted 
message will go to a wrong destination (since there is just one path 
in the network for each source/destination pair); therefore, the 
message will have to be resent into the network to the correct des- 
tination. Consequently, this mode of operation requires a connec- 
tion between each network output and corresponding input (a 
wrapped-around organization). 


Figure 7. Behavior of the Diverting Switch 


Diversion has potentially a better performance than dis- 
carding because the packets are rerouted from a source that is dif- 
ferent from the original source. This makes it possible for the mes- 
sage to avoid the NUTS in the second pass. 


Since to obtain a good performance it is convenient to 
reduce the number of packets that are diverted, whenever a conflict 
occurs and one of the packets in the conflict has already been 
diverted (in that pass through the network) we give preference to 
the nondiverted packet (to go to the correct destination). 


Because of the diversions, a message might traverse the 
network several times before getting to its destination. A possible 
problem with this form of operation is that it is not possible to as- 
sure that a particular message will have a bounded delay. To avoid 
this and make the delay more uniform, we give preference to older 
packets. 


Diverting policies 

Once a packet is diverted in the network, it cannot reach 
its desired destination during that pass; it has to go through the net- 
work again. This means that we now have a great deal of freedom 
in deciding where to route these diverted packets. The main goal 
is to route the diverted packet to an interim destination that has a 
“‘clear’’ path to the true destination, so that it will not be diverted 
again. : 
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Figure 8. Simulation Results for Diverting Switches 
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We have experimented with several diverting policies. We 
present here the results of two of them, to show that diverting pro- 
duces an improvement in the performance and that the specific 
diverting policy has an impact. 

The first diverting policy we call direct diverting. In it the 
routing of the diverted message continues using the destination tag. 
That is, each time the message is diverted, the actual destination is 
wrong in the corresponding bit. As shown in Figure 8 the perfor- 
mance is significantly better than with the blocking policy. 


The second diverting policy we call complement diverting. 
In this case, once a message is diverted, instead of using the desti- 
nation tag for routing, it is routed using a tag corresponding to the 
complement of the source. On its next pass through the network, 
the original destination tag is again used. This policy has the ad- 
vantage that it assures that the rerouted message will avoid the 
NUTS where it was diverted in the first pass. Of course, it can 
pass through some other NUTS. 


Figures 8(a-c) show the corresponding performance for the 
various traffic patterns. We see that this policy produces a some- 
what better performance than the direct policy. 


These simulation results indicate that the use of diverting 
switch improves the throughput-delay characteristic of the network 
when the traffic produces NUTS. Moreover, the use of diverting 
switches makes the distribution of throughput more uniform, as 
shown in Figure 8(d). 


7. Network with alternate paths 
The MIN’s previously considered have the characteristic 
of a unique path between each source-destination pair. Several re- 
ports [12, 13, 14] have described adding redundant paths to MIN’s 
to improve fault tolerance characteristics. These alternate paths 
can also improve the performance of a fully functional network. 


In particular [13] and [14] propose the addition of links to 
connect switches in the same stage into rings so that from any 
switch in a particular ring packets can reach the same subset of 
destinations. The application of this technique to the OMEGA net- 
work is illustrated in Figure 9(a). If a packet entering a switch 
finds the desired output queue full, it can be re-routed to another 
switch in the same group via the alternate path link, and still be 
able to reach its true destination directly without a second pass 
through the network. 


This modified OMEGA network requires augmented 
switches acting as a 3x3 crossbar, as shown in Figure 9(b). The 
routing control is somewhat more complex than for the original 
2x2 switch. We still use a diverting policy, because this has given 
better performance for the original network and this policy is also 
simpler to control since no ‘‘full signals’’ are needed. Each cycle 
up to three packets enter the switch. They are placed in the output 
queues giving priority to the older packets that have not been 
diverted (in that pass). The highest-priority packet is always placed 
in the correct queue, since there is always at least one space in 
each queue (because one packet leaves each queue each cycle). 
The next packet is placed in the correct queue, if there is space, or 
in the alternate queue. Finally, the least-priority packet is placed in 
the correct queue, in the alternate queue, or in the wrong queue 
(diverted). 
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Figure 9. An Alternate Path Network and Switch Element 


Figure 10 shows the performance of the network with al- 
ternate paths for two of the traffic patterns. We can see that the in- 
troduction of alternate paths produces a significant reduction in the 
degradation due to NUTS. 
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Figure 10. Throughput-Delay Graph (alternate paths) 


8. Conclusions 


We have shown several traffic patterns that produce NUTS 
in multistage interconnection networks and therefore result in a de- 
gradation of performance. The randomization technique, proposed 
for eliminating imbalances of loading in multicomputers, is not ap- 
propriate in this case because it increases the delay of each packet 
and the real traffic through the network. The use of discarding 
switches is not advantageous either because the discarded traffic 
has to be resent through the same congested path. 


As positive solutions, we have shown that diverting 
switches produce a significant reduction in the degradation. More- 
over, the control of congestion is simpler than that for blocking 
switches because no ‘‘full signals’’ are needed. However, to imple- 
ment this policy, it is necessary to have a network with wrap- 
around connections. 


The performance is much better using networks with alter- 
nate paths. However, this network require 3x3 switches instead of 
the basic 2x2, which complicates the implementation. 


The use of these modifications to the basic blocking policy 
in the control of contention in multistage interconnection networks 
makes it possible to use the network effectively for a larger variety 
of traffic patterns. 
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Abstract 

A self routing algorithm for passing Linear class of permu- 
tations in BeneS, 7 and (2n — 1)-stage shuffle exchange 
networks of N = 2” inputs/outputs is presented. In these 
networks, switches in the first (n — 1) stages are set by 
comparing the destination tags of the inputs to the switch; 
switches in the remaining stages are set by the self rout- 
ing Q algorithm. Thus, the total time required for routing 
any Linear permutation is O(n), same as the network de- 
lay time. The algorithm also routes Q-! permutations in 
Benes and 2 permutations in 7 network trivially. The class 
of permutations that are routable by the algorithm is much 
richer than the class of Linear permutations. This algo- 
rithm routes all possible permutations for 4 input /output 
Benes network B(2) (same as 3-stage shuffle exchange net- 
work) and 7—network, since all the permutations are in the 
‘Linear Class. 


1 Introduction 


Typically, a parallel computer consists of a number of pro- 
cessors and an interconnection network for exchange of 
information between them as well as with memory mod- 
ules. Considering a processor/memory network model, any 
processor should be able to communicate with any mem- 
ory module which is called full access. To support SIMD 
type computations, ideally we would like the network to be 
able to perform all the permutations that allow simulta- 
neous use of the memory modules. Such capabilities exist 
in crossbar networks and networks that are rearrangeable, 
for example the Bene’ network. 

We view parallel computing as computation steps—during 
which time some or all of the processors are busy comput- 


ing, and communication steps—at which some permuta- 


tion function is set up by the network to allow data ex- 
changes. If the underlying network can not support a re- 
quired permutation function then it has to be realized in 
multiple steps. The advantage with a rearrangeable net- 
work is that any permutation can be realized in one com- 
munication step. Further, if they are built using smaller 
switches such as 2 x 2, then they are relatively cheaper 
than crossbar networks. Therefore rearrangeable networks 
are used in some parallel computer implementations (e.g. 
Ghai 1). 


A well known rearrangeable network is the Bene’ net- 


*This research is supported by the NSF Presidential Young Inves- 
tigator Award No. MIP 8452003, DARPA/ARO Contract No. DAAG 
29-84-k-0066, ONR Contract No. N00014-86-k-0602. 
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work [2] which is built in a recursive manner using 2 x 2 
switches, and is shown in figure 1. In such networks, it 
takes some time to set up the switches to realize a given 
arbitrary permutation. For an N = 2" inputs and out- 
puts Bene’ network, determining the switch settings to re- 
alize an arbitrary permutation takes O(N log NV) time on 
a uniprocessor computer[7]. If the required permutations 
change frequently while computing a problem, the commu- 
nication time may become a bottleneck. An approach to 
solve this problem is to compute the switch settings for a 
given permutation using a parallel computer with N PE’s. 
A separate network with static links between the PE’s in 
the parallel computer under consideration could be used 
for this computation as suggested by Nassimi and Sahni(6]. 
Alternatively, the Benes network itself can be set to realize 
perfect shuffle permutation easily, to convert the parallel 
computer under consideration to a perfect shuffle computer 
and determine the switch settings in O(n°) time using the 
algorithm proposed by Nassimi and Sahni[5]. However, it 
still takes considerable amount of time to realize a permu- 
taion compared to the propogation delay O(n). 

We are interested in developing fast self—routing algo- 
rithms for many useful permutations required in paral- 
lel processing, if not for all the N! permutations. Due 
to the nature of techniques used in developing parallel 
algorithms, the permutaitons required are generally nice 
and regular and can be expressed as algebraic functions. 
Some work was done on developing self-routing algorithms 
for classes of permutations, in particular Bit-Permute- 
complement (BPC) by Nassimi and Sahni|6]. They also 
prove that their algorithm routes the Lenfant’s FUB 
families(3}. 

In this paper we develop self-routing algorithms for the 
Linear Class (£) of permutations. The algorithm is very 
simple and routes many other classes of permutations as 
well. We consider Bene’ network as well as the 7 network 
of Yew and Lawrie[8] and (2n — 1)-stage shuffle exchange 
network. The results include simple routing algorithms 


for the classes £ (we extend this class with complements 
of bits), 2, and Q-1 on all these networks. For other per- 
mutations one can use a general looping type algorithm or 
break it into multiple simpler permutations. 


2 Routing in Benes Network 

We will use I to represent any source and O to represent 
its destination tag. All binary additions in this paper are 
modulo 2. 
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Figure 1: 2” input/output Bene’ network B(n). 


Definition 1 A permutation is said to be a _ linear 
permutation|4] if for all input I (whose binary represen- 
tation ts (In In-1 ... J1)) and output O (whose binary rep- 
resentation ts (On On-1 ... O1)) pair there exists a non 


singular binary matriz Qnyn that satisfies 1. 

O=Ox iT (1) 
Definition 2 Let I = (In In-1 ... 11,1). A permutation 
ts a@ Linear-Complement (LC) permutation if there exists 
a binary matriz Payny1 where the submatrix of P formed 


by taking first n columns ts non singular, such that every 
(1,0) pair satisfies the equation 2. 
of =PxI (2) 
With the definition given above, fC contains BPC. 
Throughout this paper we will assume that the num- 
ber of inputs/outputs to the interconnection network is 
N = 2". We will denote linear-complement, omega and 
inverse omega permutations on N inputs in compact form 
as £C(n), Q(n) and Q-1(n) respectively. And B(n) denotes 
Bene’ network with N inputs/outputs. 


2.1 Routing Algorithm 

Let the output lines of a switch be numbered as ‘0’ and 
‘1’ for upper and lower outputs respectively. Each input 
line to a switch will have a routing bit. An input line 
to a switch is connected to the output line of the switch 
indicated by its routing bit. If the bit is ‘1’ then that input 
is connected to the lower output of the switch otherwise, it 
is connected to the upper output of the switch. Routing of 
LC permutations in Benes network is given by the following 


algorithm. 


Algorithm 1 For the first (n — 1) stages, an input line to 
a switch in stage 7,1 <7 < (n—1) will have z-th bit of its 
destination tag as its routing bit. For the next n stages, an 
input line to a switch in stage j,n <j < (2n—1) will have 
(2n —j)-th bit of its destination tag as its routing bit. For 
the first (1 —1) stages, switches are set up such that input 
line with smaller destination tag value is routed according 
to its routing bit. For the next n stages switches, are set 
up such that both the inputs are routed according to their 
routing bits. | 
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In first (n — 1) stages, conflicts are resolved by giving pri- 
ority to one of the input lines. This algorithm is different 
from that of Nassimi and Sahni’s|6] since in case of con- 
flict in setting up a switch, their algorithm gives priority to 
the top input line, whereas our algorithm gives priority to 
the input line with smaller destination tag value. Consider 
figure 2(a) with destination tags for its inputs as shown. 
Let the bit indicated by the arrow be the routing bit. In 
this case, routing bit for both the inputs is ‘1’ so there is 
a conflict. This is resolved by comparing the destination 
tags and giving priority for the input with smaller destina- 
tion tag value, which in this case is the lower input. The 
other input line is automatically routed to the remaining 
output line. In figure 2(b) routing bits for both inputs are 
different so they get what they want and the switch is set 


as shown. 
LI 
O11 011 100 O11 


(a) (b) 
Figure 2: An example showing switch settings done by the 
algorithm. 
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Figure 3: Routing a LC permutation in Benes using the 
algorithm proposed 


A complete example of this routing scheme is given in fig- 
ure 3. Destination tags for each input line to a switch are 
given in the binary form. Routing bit for each stage is 
indicated by an arrow. This permutation is not routable 
by Nassimi and Sahni’s (see figure 4) algorithm. The LC 
permutation given in the figures 3 and 4 has the functional 
form given below. 
O;=h; O22 =; O = bth 

In the first stage (figure 3), routing bit is same for both the 
inputs to a switch. Hence switches in the first stage are set 
up such that input with smaller destination tag is routed 
correctly, which in this case are top input lines. After the 


first stage of routing, there exists LC permutation between 
O3,O2 of destination fe and aa of input _ for both 
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Figure 4: Routing an LC in Benes using Nassimi and 
Sahni’s algorithm fails. Incorrectly routed inputs are indi- 


cated by an asterisk. 


top 4 x 4 Benes network B(2) given as, 
OO; =1h+h; O2=f 

and bottom B(2) given as, 

Og =1l4+ht+h; O2=Ir 
There exists conflict in setting up switches in the second 
stage of the network as well. For top most and bottom 
most switches in the second stage top input line has a 
smaller destination tag value, so these switches are set to 
route top input line correctly. For the other two switches 
bottom input lines have smaller destination tag value, 
hence, those switches are set to route bottom input lines 
correctly. Conflict exists only in the first two stages of the 
network. Last 3 stages are routed without any conflicts as 


given by the algorithm. 


2.2 Proof of Correctness 

Theorem 1 Any LC(n) permutation is routable by the 
routing algorithm 1, in B(n). 

Proof: We will use the fact that stages 2,...,2n — 2 of 
B(n) are just two B(n —1) networks, to prove the theorem 
by induction. To do this we need to show that after first 
stage of routing, the resulting permutation between most 
significant (n — 1) bits of the destination tag to an input 
of B(n — 1) is still an L?(n — 1) permutation. 


More formally, this is true for n = 1. Let it be true for all 
m <n. Now consider the following lemma. 


Lemma 1 After one stage of routing of an LC(n) per- 
mutation using the algorithm 1, for any input-output 
pair I and O, the permutation between (On,...,O2) and 
(In-1,---,11) for the top and bottom Benes networks for 
2”-* inputs/outputs belong to LC(n-1). 

Proof for the lemma: Inputs to a switch differ only in bit 
I. So depending on whether the equation for routing bit 
O, contains J; or not, the routing tags of the inputs to a 
switch are different or are same. Consider the first case; 


the equation for O, will be of the form O,.=1,+ LF, 
where LF; is independent of I,. Since each input is routed 
according to its routing bit because there are no conflicts, 
the equation for O, after exchange is given as O, = Ih. So 
the effect of exchange is like substituting J; + LF, in all 
occurrences of [, in the equations for On, .,O,. Since an 
inverse shuffle is performed after en ance. ‘all the top out- 
puts of the switches go to the top Bene’ network B(n—1) 
and all bottom outputs of the switches go to bottom Beneé 
network B(n — 1). So substituting I; = 0(1) in the equa- 
tions for bits On,...,O2 of the routing tags of the inputs 
routed to fop(bettom) B(n—1) we get LC(n—1) permuta- 
tion as desired. In the second case, the equation for O; will 
be of the form, O; = LF,, where LF, is independent of J,. 
Let k be the most significant bit in which two destinations 
differ. Then the equation for O;, contains I, and is given 
as O, = 11+ LF, LF, is independent of I,. The algorithm 
routes inputs such that input with O; = O, is routed to 
top ouput line of the switch and the other input to the 
bottm ouput line of the switch. So after the exchange op- 
eration O; = I; + Ox. So the net effect is equivalent to 
substituting J, + LF, + LF, in all the occurrences of equa- 
tions for On,...,O 1. Since an inverse ‘shuffle is performed 
after ethane as in the previous case we get LC(n — 1) 
permutation between O,,...,O2 bits of the routing tags 
and inputs I,1,...,J, of the top and bottom B(n ~ 1) 
networks. | 
From the above lemma LC (n) is routed in the first stage 
of B(n) such that there exists LC(n — 1) permutation be- 
tween O,,:++,O2 and the inputs of B(n — 1). Since this 
is correctly routed by induction hypothesis, after (2n — 2) 

stages all the outputs are in the correct place as far as 

first n —1 bits are concerned. This means two destinations 

which differ only in the last bit of their destination do not 

exist in the same B(n — 1). A shuffle and exchange will 

route these inputs to the correct places. | 


3 Routing in Shuffle Exchange 


Networks 


We will modify the routing algorithm to route LC permu- 
tations in m—network. A m-network is a cascade of two 2 
networks [8]. 


3.1 Routing Algorithm 


Algorithm 2 For the first n stages of the pi—network, an 
input to a switch in stage 2,1 <i < n, will have (n—i+1)- 
th bit of its destination tag as the routing bit. Routing is 
done as follows. First the destination tags are bit reversed 
and then compared. The smaller one will be routed accord- 
ing to its routing bit as before. For the next n stages of 
the network we use the standard 2 self-routing algorithm. 

i 


A complete example is given in figure 5. Routing bit in 
each stage is indicated by an arrow. This permutation is 
not routable by the self routing algorithm given in [8]. 
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Consider second switch from top in stage 1 of figure 5. 
Both inputs have the same routing bit ‘1’. But upper input 
has destination tag with smaller value when compared to 
that of lower one after bit reversal of the destination tags. 
Hence upper input is routed to the lower output of the 
switch. But in the case of bottom most switch in the first 
stage lower input has smaller destination tag value after 
bit reversal. So, that switch is set such that lower input is 
routed according to its routing bit which is lower output 
of the switch. 


3.2 Proof of Correctness 


We need the following lemmas, to prove that the algorithm 
works correctly. 


V V yy V V 
Of one Of 0 


>< 


Figure 5: Routing an LC permutation in m—network using 
the algorithm proposed 


Lemma 2 If a permutation is LC permutation then after 
a shuffle on the input bits, the resulting permutation is still 


LC. 
Proof is obvious, hence omitted. i 


Lemma 3 If a permutation is LC permutation then after 
performing an exchange operation on the inputs using the 
algorithm 2 the resulting permutation ts still LC. 


Proof for this lemma follows very closely that of lemma 
1. Crucial part of the proof is showing that for the first 
nm stages the algorithm performs exchange operation such 
that routing bit O; is set to In-i+1 if the equation for O; 
contains I,_;11, otherwise to In-i41 + O; for some j <i as 
specified by the algorithm. i 
In the case of Benes network we noted that two input lines 
to a switch in stage 1 differ only in X,. However, in the 
case of the 7 network we shall take into account the fact 
that a shuffle was performed before the exchange operation 
hence I, becomes [;. So the two input lines to a switch 
in the first stage of a 7 network differ only in J,. In the 
first stage, algorithm 2 will route using O,, as the routing 
bit. So there will not be any conflicts if O, contains da 
Proceeding in this manner it is easy to see that for the first 
n stages there will not be any conflicts if the routing bit 
On-it1 contains [,_ 344. 


Lemma 4 Routing an LC permutation using algorithm 2 
will always assure that at any stage i, 1 <i < n, the 


destination tags for the inputs of a switch will differ atleast 
in one of the bits On_j41 through Oj. 

Proof: This is true for stage 1 since LC is a bijection. Af- 
ter (m — 1) stages of shuffle-exchange I will be of the 


form Gens cee ener py Seeeeeses I; means either J 
or I;, complement of J;. After a shuffle I will be of.the 
103016310 Gt Serre ie See ee ee ydn—m42;In—m41). So destination 


tags for the two inputs of a switch will not differ in any 
of the bits On_m41,.-.-,O1 iff none of the equations for 
On—-m+1)---,O1 contain I, -m41. From lemma 3 we know 
that after an exchange operation at stage 2, On_i41 is set 
to In-i41 or to In_i41 + O;, 7 <1, whereupon O; is set to 
In-i41 + On-izi. So the equations for On,...,On—-mi2 are 
either independent of I,~m41 or if they contain the term 
In-m+1 then the equation for some O;, 1 <j < (n—m+2) 
will also contain that term. | 


Theorem 2 LC permutations are routable using the algo- 
rithm 2 in r-network. 


Proof: From lemmas 2 and 3 it follows that after routing 
one stage of shuffle and exchange the resulting permuta- 
tion is still an LC permutation but it could be different 
form the earlier one. So to distinguish this, we use super- 
script for the matrix Pyx(n41). So the P matrix in the £C 
permutation for stage 2 is indicated as P*. The P matrix 
for the first stage denoted as P' is same as the P matrix 
in the original LC permutation. 
Consider an input J = (J,,...,4,) with destination O = 
(On,.--,O1). Let I be of the form D = (D,,...,D1) after 
routing first n stages using the algorithm. From lemma 3 
and the discussion following the lemma, at any stage, 1 < 
i<n, Dn-iz1 is same as On_;41 if the destinations have 
different (n —2+ 1)-bit, which is true only when Pi, # 
0. Otherwise, Dy_is1 = On-i41 + OJ, J < (n-2 41). 
Therefore we have, 


Di Oa AP OR Rae (3) 
Thus the equations for D,,...,D will be of the form given 
below. 7 
Dr, = On+P},0;, for somej <n 
Dn-1 = On1+ P2,0;, for somej <n—1 
dD, = OF 


We can rewrite these equations such that they are in the N 
characteristic equation form. For example we can rewrite 
the equation for D, as given below. 

On = Dnt Pi 03, 
But, O; can again be rewritten in the form given above. 
In general we can substitute D; + Pj, O;, for some j < k, 
for O,. Proceeding in this manner we obtain the equation 
for On only in terms of D’s. In the same manner we can 
rearrange equations to get equations for all O’s as given 


for some 7 <n 


below. 
O; = D, + Fi(Di-1,..-,D1), Ll<i<cn (4) 


Clearly these equations are in 0 form hence routable by 
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the last n stages of the +—network. | 
As an example consider the £0 permutation in figure 5 
characterized by the following set of equations. 
O3=h; Oc=kh; Q.=bt+h 
Here Og does not contain Iz but O; does. Hence substitut- 
ing [3+ ,(= DF3)+1,(= LF) in all the occurrences of J; 
and performing a shuffle on input bits we get the following 
set of equations which characterize LC permutation for the 
second stage. 
Os=h; Or=Istht+h; Aa=h+h 
One can verify that these equations hold after the first 
stage of switches. Ds is given by the following equation. 
Ds = O3 + O, 
In a similar manner we can obtain equations for D2 and 
Dy, as given below. 
Dz, = 02; Di =O; 
Rewriting these equations we get, 
O3 = D3+ Di; O2 = D2; O1 =D; 
which are in characteristic 2 form, hence routable in the 
last three stages of the network (same as 8 input/output 
Q network). a 


3.3 (2n — 1)-stage Shuffle Exchange Net- 
work 

In the proof given above, we showed that D; = O,. This 
implies that all the switches in the last stage are set 
straight. Hence, we can eliminate all the switches in the 
last stage. So, we need only (2n — 1) stages of shuffle ex- 
change and a perfect shuffle. However, we can eliminate 
this shuffle at the output as follows. 

We change the algorithm to treat destination tags as if 
a shuffle was performed on them. i.e., O; is treated as 
O41) modn+ Let the given permutation be denoted as II. 
With the modification the algorithm treats as if a shuffle 
was performed on II. Hence in effect it routes II = (oll). 
After routing for (2n—1) stages a o is required to route IT 
correctly. So, after (2n —1) stages we have routed (o~1IT ) 
correctly. But, o~!II’ = o'oII = II. Hence, 


Theorem 3 LC is routable in (2n — 1)-stage shuffle ez- 
change using the modified algorithm described above. 


4 Conclusions 


In this paper we have presented algorithms to route LC 
permutations on Benes, 7 and (2n — 1)-stage shuffle ex- 
change networks. Since there will not be any conflicts in 
the first n stages of the Benes network if the permutation 
is in 07+, this algorithm routes Q7! permutations as well. 
In fact the class of permutations routable using the algo- 
rithm given in this paper is much larger than LC class. 
With a similar argument any 2 permutation is routable 
using the algorithm in 7 network. It is interesting to note 
that it routes all permutations in B(2) for 4 input/output 
Benes network. However this algorithm does not route 


all Q permutations in Bene’ networks, with N > 4. If the 
permutation is known to be 2 then it can be routed by set- 
ting the first (n — 1) stages of the Benes network straight 
as suggested by Nassimi and Sahni [6]. 
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Abstract 


This paper introduces a class of multiple path multis- 
tage interconnection networks termed Reduced Size Intercon- 
nection (RSI). A RSI-m network of size N is designed to 
have m unique path multistage networks of size N/m. This 
approach of designing the network allows to construct a 
fault-tolerant network at the same cost for a unique path 
multistage network of the same size. We have considered the 
cost-effectiveness of RSI networks for reliability and for per- 
formance. RSI networks are shown to be compared favor- 
ably with some well-known multistage networks. 


1. Introduction 

Multistage interconnection networks such as Omega 
networks [6] and Delta networks [10], have been favored for 
processor-memory connection in “‘large-scale’’ shared memory 
machines because of its cost-effectiveness. However, as is 
well known, the multistage network lacks fault-tolerant capa- 
bility because of its basic property that there is only a single 
path between any source-destination pair (unique path net- 
work). This lack of fault-tolerant capability has received 
considerable attention, and many ways of providing fault- 
tolerance to the network have been proposed. 

The basic idea of fault-tolerant network is to provide 
multiple paths for a source-destination pair so that alternate 
paths can be used in case of faults in a path. Providing mul- 
tiple paths can be done in various ways. The methods 
include increasing the number of stages [2], using multiple 
links between switches [3, 7, 9], increasing the size of 
switches [8], partitioning a unique path network into several 
subnetworks [5, 12], and incorporating multiple copies of a 
unique path multistage network |4, 11]. Compared to the 
unique path networks, these multiple path networks certainly 
have higher reliability but with increased hardware complex- 
ity, which not only increases cost but also puts some wrinkle 
on the claim of enhanced reliability. 


In this paper, a class of multiple path multistage inter- 
connection networks, dubbed as “Reduced Size Interconnec- 
tion”, is proposed to provide a cost-effective fault-tolerant 
network for large-scale shared memory parallel computers. 
The network is designed to provide fault-tolerant capability 
without increasing hardware complexity and cost over a 
unique path multistage network. 


2. Reduced Size Interconnection Network 


Suppose, for a network of size N, 1.e., with N sources 
and N destinations, m disjoint partitions of N/m sources 


and N/m destinations are formed first, where m (>2) and 
N (>m.) are the powers of 2.1 In a Reduced Size Intercon- 
nection (RSI) network of size N, m unique path multistage 
networks of size N/m are provided, i.e. one for each parti- 
tion, and each source and destination are linked to all the m 
unique path networks via mX1 multiplexers and 1xXm 
demultiplexers, respectively (the rules for the connection 


come shortly). Thus, a RSI-m network of size N has one 
m X1 multiplexer stage (input stage), ioe stages (inter- 
m 


mediate stages) of 22 crossbar switches, and one 1Xm 
demultiplexer stage (output stage). There are N/2 switches 
in each of the intermediate stages, N multiplexers in the 
input stage, and N demultiplexers in the output stage. Each 
unique path network of size N/m plus its associated multi- 
plexers and demultiplexers will be called a subnetwork. 
These subnetworks will be denoted by G®, G!, ---, @™71. 
Although it is possible to have different types of unique path 
networks, we assume that all the m unique path networks 
are of identical type. The type of unique path network 
taken is called base network . 


Let S;; and D; ; denote the source 7 and the destina- 
tion 7, respectively, which are associated with a subnetwork 
G' based on the partition (0<i <m-1 and 0<j <N/m-1). 
Also, let MUX, , and DEMUX;, , represent the multiplexer & 
and demultiplexer k in G!, respectively, where O0</ <m-1 
and O0<k<N/m-1. Then, the sources and destinations are 
connected to each subnetwork as follows: 


I Each S; ; is connected to every 7 input port of the m 
tJ put p 


multiplexers from MUX; to MUX,,-;;. 


ii) Every i output port of the m demultiplexers from 


DEMUX,; to DEMUX,,_;,; is connected to each D; ;. 


An example of RSI-2 network of size 8, in which Omega net- 
work is used as the base network, is illustrated in Figure 1. 


Routing in a RSI network can be divided into three 
steps, assuming that the selection of a path out of the m dis- 
joint paths is already done at a source. At the input stage, 
i.e. the stage of multiplexers, a request with a routing tag 
just passes a multiplexer in a subnetwork which is chosen by 
the source. None of the tag is consumed because a multi- 
plexer is always with only one output port. After that, the 
routing of the request in the intermediate stages is the rout- 


ing in the base network of size N/m by using bee bits. 
m 


‘Rach partition may have different sizes. Also, the number of partitions may 
vary. However, in this paper, we consider only the equal partitioning and the 
“small” value of m (2 or 4). Also, for the sake of simplicity, our discussion is res- 
tricted to “rectangular” networks, i.e. having the same number of inputs and out- 
puts, of 2 X 2 switches. 
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Source MUX 


Soo Doo 

Soi Dox 
G® 

So2 Do 

So3 Do, 

Sig Do 

Sit Dy) 
G} 

Sia Dy 

S13 D3 
Figure 1. A RSI-2 network of size 8 with 2x2 switches. 


Finally, the request arriving. at the output stage, ie. the 
stage of demultiplexers, is routed to the proper destination 
by using logym bits, the rest part of the tag. Notice that 
selecting any particular subnetwork does not change the 
routing in a RSI network: the routing algorithm is always the 
same regardless of the subnetwork selected by a source. 


3. Reliability of RSI Network 


We consider the reliability of RSI networks under the 
“full access” criterion, and measure their reliability in terms 
of ‘Mean Time to Failure” (MTTF). We assume that any of 


the switching components — crossbar switches, multiplexers, 
and demultiplexers — in a RSI network can fail. Based on 
the recent survey by Adams, Agrawal, and Siegel [1], the 
fault model and the fault-tolerance criterion applied in our 
analysis are common and strict. 


Before we are involved in the reliability analysis in 
terms of MTTF, the number of faults that RSI networks can 
tolerate is worth to be mentioned. Since a RSI-m network 
provides m disjoint paths for each source-destination pair 
through the m independent subnetworks, it is (m—-1)-fault 
tolerant and is robust in the presence of more than m-l 
uae (Siog,+2N) faults in the network. 

m 2 m 

To make the analysis of MTTF tractable, we use 
assumptions similar to the ones that have been made previ- 
ously in other studies of fault-tolerant networks. Each com- 
ponent has an independent, Poisson distribution of failures 
with a constant failure rate. The failure rate is assumed to 
be proportional to the gate complexity of a switching com- 
ponent. The complexity of a component is considered in 
terms of a ‘‘crosspoint.”” Thus, we assume the failure rate for 
a mX1 multiplexer or an 1Xm demultiplexer \ =~ 
(m-1)x/4, if the failure rate for a 2X2 crossbar switch is 
\, because an m X1 multiplexer or an 1Xm demultiplexer 
has m-1l crosspoints while a 2X2 crossbar switch has 4 
crosspoints. 


faults, up to 


By considering each stage of the network separately, we 
have an optimistic probability that a RSI-m network is not 
faulty for a time period (0, ¢t) : 


Rrst-m(t )=(1-(1-e™! y" 1% iisett /4)m ae 


where M=3~X(log.) and M.,=—2N/m. So, we have 
m m 
the upper bound 


oo 
MTTF ps1-m=J Rrst-m(t)dt . 
0 


Since a sufficient condition for the network to be operative is 
that at least one of the subnetworks is fault-free, the lower 
bound is 
o.¢) m-l m 
—(M,+M x dea), ; 
MTTFéy-m—=f(l-(l-e = 4——«*d:sCédt 
0 


where My=3~X (loge) and M,=2N /m. 
m m 


For comparison purpose, we obtained the MTTFs of 
Omega network and some other fault-tolerant networks (ESC 
[2], 3-replicated [4], and INDRA with R =2 {11]) in a similar 
way. Although Omega network is not a fault-tolerant net- 
work, we use its reliability as a yardstick to measure the 
improved reliability of RSI networks. The ratios of the 
bounds on MTTF of the fault-tolerant networks to that of 
Omega network are shown in Figure 2, in which the network 
size N varies from 16 to 1024; one can easily see that RSI 
networks have significant improvement on the reliability 
compared to Omega network. 


4. Performance of RSI Networks — 


The important performance measure for unbuffered 
interconnection networks is “bandwidth (BW)” or “probabil- 
ity of acceptance (PA)’. The performance analysis of 
unbuffered RSI networks is based on the assumptions made 
for the usual “uniform traffic model” [4, 5, 10, 11]. Concern- 
ing the selection of a particular path out of the m disjoint 
paths in the network, we assume “uniform random”’ selection 
at each processor; each path is selected randomly with equal 
probability. We also assume that destinations are able to 
accept more than one requests simultaneously, i.e., the 
memory modules are considered as multi-ported memory 
units which can be accessed through more than one ports at 
the same time. 
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N : network size 


R = MTTF of a fault tolerant network/MTTF of Omega network 
--- : for lower bound 
— : for upper bound 


All the networks are of 22 switches. 


Figure 2. Ratios of MTTF for some fault-tolerant networks 
with respect to that of Omega network 
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All the multistage networks are of 22 switches. 


Figure 3. Comparison of PA of Unbuffered Networks. 


Following the analysis of Delta network by Patel [10], 
the PA’s of RSI-2 network and RSI-4 network of size N from 
16 to 1024 were computed. A typical comparison of PA’s of 
RSI networks with those of Omega network and crossbar 
network is shown in Figure 3. We can find that with an 
increased value of m, a RSI network may provide higher 
bandwidth and probability of acceptance than those of an 
Omega network of the same size, which is the result of hav- 
ing multiple ports at each memory module. As one of the 
paths becomes faulty, the number of paths a processor can 
utilize decreases. Assuming that a faulty subnetwork of the 
RSI network is totally unusable, we considered the perfor- 
mance degradation with at most m-1 faulty subnetworks. 
Figure 4 shows the probability of acceptance of an 
unbuffered RSI-4 network with the number of faulty subnet- 
works from 0 to 3, when the request generation rate p equals 
0.5 and 0.1, respectively. From this Figure, we can see that 
RSI networks can achieve graceful degradation of the perfor- 
mance and that the performance degradation will not be 
significant when the traffic is “light.” The performance study 
of buffered RSI networks has also been carried out, and the 
results are similar to the case of unbuffered RSI networks. 


5. Cost-Effectiveness 


For an interconnection network designed for large-scale 
general-purpose parallel computers, one of the important 


considerations is its cost-effectiveness. If the high perfor- 
mance and reliability of a network comes at the expense of 
too high cost, it may have little value in practice. 

For the cost-effectiveness, we first need to figure out the 
cost of the networks. To estimate the cost of a network of 
size N, one common method is to calculate the switch com- 
plexity with an assumption that the cost of a switch is pro- 
portional to the number of gates involved, which is roughly 
proportional to the number of “‘crosspoints”’ within a switch 
(10, 13]. For example, a 2X2 switch has 4 units of hardware 
cost whereas a m X1 multiplexer has m-1 units. In this 
way, Omega, 3-replicated, INDRA (R =2) networks of 2x2 
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Figure 4. Probability of Acceptance of an Unbuffered 
RSI-4 with Faulty Subnetworks. 


switchs and crossbar network have costs of 2Nlog.N, 
6NlogsN , 4N(log.N +1) and N 2' respectively. Also, a RSI- 


m network has the cost of 2N (logs hm -1), and an ESC 
m 


network has the cost of 2N (log.N +2). 


Now, a simple measure of the cost-effectiveness for relia- 
bility can be given by comparing MTTFs of the networks 
with respect to the cost. Let the cost-effectiveness, 7, of a 
network for reliability be the ratio of its MTTF to its cost. 
The cost-effectiveness 7 of some fault-tolerant networks rela- 
tive to that of Omega network (for both upper bounds and 
lower bounds) are shown in Figure 5. In the same way, a 
simple measure of the cost-effectiveness for performance can 
be given by comparing the probability of acceptance of the 
networks with respect to the costs. Figure 6 shows the cost- 
effectiveness of RSI networks and crossbar networks for per- 
formance relative to that of Omega network. We can see 
that many advantages of RSI networks in reliability and in 
performance comes at a modest cost. 7 


6. Conclusion 


We proposed and analyzed a class of fault tolerant mul- 
tistage interconnection networks, named Reduced Size Inter- 
connection (RSI). By providing m identical subnetworks of 
size N/m for a RSI-m network of size N, we can achieve 
significant reliability gain and good performance at the same 
cost for constructing a unique path multistage network of the 
same size. 

The performance analyses of unbuffered and buffered 
RSI networks have been carried out. We considered the per- 
formance in the fault-free situation and in the presence of 
faults. The results showed that compared to a unique path 
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Figure 6. Ratios of PA/Cost for some unbuffered networks 


with respect to that of Omega network 


multistage network, a RSI network improved reliability 
without decreasing the performance. Also, the performance 
degradation due to faulty subnetworks can be insignificant 
for “light” traffic. 


To show. the cost-effectiveness of RSI networks, the 
comparison with other networks for reliability and for perfor- 
mance was made. The results indicated that RSI networks 
are compared favorably with other fault-tolerant networks 
such as ESC, INDRA (R ==2), and 3-replicated network. 
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Abstract 


This paper considers a mesh with reconfigurable bus (re- 
configurable mesh), that consists of a VLSI array of processors 
connected to a reconfigurable bus system. The N PEs are 
laid out as a square mesh in O(N) VLSI area. The recon- 
figuration scheme can be used to dynamically obtain various 
interconnection patterns between the PEs. In fact, the ar- 
ray can be used as a universal chip capable of simulating any 
O(N) area organization with a planar wiring layout without 
loss in time. The reconfiguration scheme also supports several 
parallel techniques developed for the CRCW PRAM. In this 
paper, we develop fundamental data movement operations for 
the reconfigurable mesh. These operations are used to give 
efficient solutions to a variety of problems involving graphs 
and digitized pictures. The running times of these algorithms 
are asymptotically superior to those developed for the mesh 
with multiple broadcasting, the mesh with multiple buses, the 
mesh-of-trees, and the pyramid computer. 


1 Introduction 


In this paper, we consider a reconfigurable VLSI array of processing 
elements that combines the advantages of a number of architectures 
including the mesh, pyramid, mesh-of-trees, and meshes with broad- 
cast buses. Due to page limitations, this paper will only summarize 
a subset of the results that we have obtained for the reconfigurable 
mesh. The reader is referred to [7] for discussions and algorithms 
associated with the results given in this paper. 

The mesh with reconfigurable bus (reconfigurable mesh) of size N 
consists of an N1/? x N1/2 array of processors connected to a grid- 
shaped reconfigurable broadcast bus, where each processor has four 
locally controllable bus switches, as shown in Figure 1. Other than the 
buses and switches, the reconfigurable mesh is similar to the standard 
mesh in that it operates in SIMD mode and has O(JV) area, under 
the assumption that processors, switches, and individual links have 
constant size. In one unit of time each processor can perform standard 
arithmetic and boolean operations on its own data, can set any of its 
four switches, and can send and receive a piece of data from the bus. 


—— Reconfigurable Bus 


® Switch 


Figure 1: A reconfigurable mesh of size 16. 
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In each subbus shared by multiple processors, at any given time 
we assume that at most one processor may use the bus to broadcast 
a value, where a value consists of O(logN) bits. Notice that by 
setting the switches properly, sub-row (column) buses can be created 
within each row (column), sub-meshes with reconfigurable buses can 
be created, a global broadcast bus can be created, distinct buses can 
be created within distinct sets of contiguously labeled processors, and 
so forth. 

Major advantages of the reconfigurable mesh are as follows. 


1. Buses can be used to speed up parallel arithmetic and logic op- 
erations among data stored in different processors. The recon- 
figuration scheme supports several CRCW PRAM techniques. 
In fact, for some problems the reconfigurable mesh is superior 


to the PRAM. 


2. The reconfigurable mesh provides an environment for efficient 
sparse data movement operations. 


3. A significant asymptotic improvement can be achieved in the 
running times of algorithms that solve several problems on the 
reconfigurable mesh compared to efficient algorithms for the 
mesh-of-trees, pyramid, and mesh with static broadcast buses. 


4. The reconfigurable mesh can act as a universal chip in that VLSI 
organizations with equivalent area and a planar wiring layout 
can be simulated without loss in time. 


Many of the algorithms for the reconfigurable mesh (c.f., [7]) will 
continually reconfigure the system by setting the switches to give the 
desired substructures. 


2 Related Architectures 


It should be noted that although there are similarities, the reconfig- 

urable mesh is very different from the CHiP project [15], the mesh 

augmented with broadcast buses{1, 3, 12, 16], and the bus automa- 

ton [4]. However, the reconfigurable mesh is similar to the polymorphic- 
torus network [6], with the major difference being that in the 

polymorphic-torus network there is an arbitrary crossbar in each pro- 

cessor to control connections between the north, south, east, and west 

bus ports. Finally, the reconfigurable mesh appears to be almost iden- 

tical to the latest version of the Content Addressable Array Parallel 
Processor (CAAPP) [18], which was developed independently of the 

reconfigurable mesh. 


3 Data Movement Operations 


Data movement operations form the foundation of numerous algo- 
rithms for machines constructed as an interconnection of processors. 


Proposition 3.1 Given a set S = {a;} of N values, distributed one 
per processor on a reconfigurable mesh of size N so that processor 
P; contains aj, 0 < i< N —1, and a unit-time binary associative 
operation @, in O(log N) time the parallel prefiz problem can be solved 
so that each processor P; knows ap @ a; @...@ aj. 


We introduce a technique called bus splitting, in which processors 
exploit the ability to locally control the effective size of subbuses, to 
obtain the following Proposition. 


Proposition 3.2 Given a reconfigurable mesh of size N, in which 
each processor stores a bit of data, the logical OR of the data in each 
row (column), or the entire reconfigurable mesh, can be determined in 


O(1) time. 


The reconfigurable mesh can be superior to other parallel models 
for performing some computations. Consider, for example, computing 
the exclusive OR (EXOR) function of N!/2 values stored in a row of 
the mesh. Note that [5] has shown that the exclusive OR function 
cannot be computed in O(1) time on a PRAM using a polynomial 
number of processors. However, by exploiting the reconfigurability 
available, the EXOR function can be computed in O(1) time on the 
reconfigurable mesh. 


Proposition 3.3 Given a reconfigurable mesh of size N, in which 
each processor stores a bit of data, the exclusive OR (EXOR) of the 
N1/2 data stored in a row (column) can be computed in ©(1) time. 


Lemma 3.4 Given a reconfigurable mesh of size N, suppose each 
processor in a row (column) stores a bit of datad;,O<j< Ni/2_ 1, 
Then, the computation of F; given by F; = Ss d;,0<j < N}/2-1, 
can be performed in O(1) time. 


Many parallel algorithms are designed to reduce data at interme- 
diate stages of the algorithm. It is, therefore, often useful to be able 
to efficiently perform fundamental operations on reduced sets of data. 


Proposition 3.5 Given a reconfigurable mesh of size N, in which 
no more than one processor in each column stores a data value, the 
minimum (mazimum) of these O(N1/?) data items can be determined 
in O(1) time. 


By a somewhat more complicated sequence, Valiant’s PRAM al- 
gorithm for finding the maximum [17] can be simulated on a recon- 
figurable mesh to find the maximum of all N values, assuming they 
are stored one value per processor. 


Proposition 3.6 Given a set of data items S of swze N stored one 
per processor on a reconfigurable mesh of size N, the mazimum value 
of S can be determined in O(log log N) time. 


Proposition 3.7 Given a set of bits S of size N stored one per pro- 
cessor on a reconfigurable mesh of size N, the EXOR of all items in 
S can be computed in O(log log N) time. 


Proposition 3.8 Suppose on a reconfigurable mesh of size N each 
processor has a label from a set of k distinct labels, 1 < k < N. 
Further, suppose processors having the same label form arbitrary con- 
tiguous regions on the mesh. Then, each processor can know whether 
there 1s at least one tagged processor with its label tn O(1) time. Also, 
given that there is at least one tagged processor for each label and all 
tagged processors with the same label store identical data, the tagged 
data can be broadcast to all other processors with the same label in 
the above time, for all labels in parallel. 
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It is often desirable to model PRAM algorithms on other ma-_ 
chines. In order to efficiently simulate the CRCW PRAM, one must 
be able to efficiently simulate the concurrent read and concurrent 
write properties. Define a Random Access Read (RAR) to be a data 
movement operation that models a concurrent read, in which each 
processor knows the index of another processor from which it wants 
to read data [11]. Similarly, a Random Access Write (RAW) will 
model a concurrent write in that each processor knows the index of a 
processor that it wishes to write to [11]. In case of multiple writes to 
the same processor, a tie-breaking scheme is used, such as minimum 
or maximum data value, or arbitrarily letting one value succeed. 


Proposition 3.9 Given a reconfigurable mesh of size N, in O(k/2+ 
log N) time k data items may be moved in a RAR or RAW, where 
k<N. 


In fact, more efficient data movement can be performed if the dis- 
tribution of the source processors, i.e., those processors sending data, 
as well as the destination processors, i.e., those processors receiving 
data, is uniform over the reconfigurable mesh. 


Proposition 3.10 Given a reconfigurable mesh of size N, if the num- 
ber of source and destination processors within any block of size k? 
is O(k), 1 < k < N’/? then RAR and RAW can be performed in 
O(log N) time. 


Another fundamental operation that involves data movement is 
data reduction. Assume that each processor has at most one record 
having a key field and a data field. Data reduction will perform an 
associative binary operation on the data of records having the same 
key. At the end of the data reduction operation, each processor with 
key k will have the result of the binary operation performed over all 
data with key k. 


Proposition 3.11 Given a binary associative operator ®, data re- 
duction can be performed on k distinct keys in O(k1/? + log N) time 
on a reconfigurable mesh of size N, so that each processor knows the 
result of applying @ over all data items with tts key. 


Lemma 3.12 Given a reconfigurable mesh of size N with k distinct 
keys randomly distributed one key per processor, the number of dis- 
tinct keys can be determined in O(k'/? + log N) time. 


4 Applicat ions 


In this section, we illustrate the performance of the reconfigurable 
mesh by giving simulations of other low wire area organizations, such 
as the mesh-of-trees and pyramid, discussing the use of the recon- 
figurable mesh as a universal chip, and by giving efficient parallel 
algorithms to solve problems involving graphs and images. 


4.1 Simulations 


Well known organizations such the mesh-of-trees and pyramid com- 
puter can be efficiently simulated by the reconfigurable mesh due to 
the numerous communications patterns that the reconfigurable mesh 
provides. In the first part of this section, we consider step by step 
simulation of the mesh-of-trees and pyramid. 

A mesh-of-trees (MOT) of base size N, where N is an integral 
power of 4, has a total of 3N —2N'/? processors. N of these are base 
processors arranged as a mesh of size N. Above each row and above 
each column of the mesh is a perfect binary tree of processors. Each 
row (column) tree has as its leaves an entire row (column) of base 


processors. All row trees are disjoint, as are all column trees. Every 
row has exactly one leaf processor in common with each column tree. 
Each base processor is connected to 6 other processors (assuming they 
exist): 4 neighbors in the base, a parent in its row tree, and a parent 
in its column tree. Each processor in a row or column tree that is 
neither a leaf nor a root is connected to exactly 3 other processors in 
its tree: a parent and 2 children. Each root in a row or column tree is 
connected to its 2 children. Notice that in the MOT the processors in 
each row and in each column can be looked upon as placed at levels 
0,1,...,& where N1/2 — 9k, 

Define a c-embedding of a hierarchical organization onto the re- 


configurable mesh to have the following properties. 


1. A constant number of processors of the hierarchical organization 
are mapped to each processor of the reconfigurable mesh. 


2. The number of communication links between levels | and /+ 1, 
0 <1 < k—1, incident on any row or column bus segment is 
<i, 


Define a class of algorithms on a hierarchical organization to be nor- 
malized algorithms if the following hold. 


1. During a computation step of a hierarchical algorithm, all data 
operated on are located at the same level of the hierarchical 
organization. 


2. During a communication step of a hierarchical algorithm, com- 
munication is performed between at most two adjacent levels of 
the hierarchical organization. 


[9] shows how to embed the mesh-of-trees into a mesh. This em- 
bedding is used to embed the mesh-of-trees into the reconfigurable 
mesh and obtain the following two propositions. 


Proposition 4.1 Any normalized algorithm running in T(N) time 
on a mesh-of-trees of base stze N can be simulated on a reconfigurable 
mesh of size N to finish in O(T(N)) time. 


Proposition 4.2 Any algorithm running in T(N) time on a mesh- 
of-trees of base size N, can be stimulated on a reconfigurable mesh of 
size N to finish in O(T(N) log N) time. Further this time is optimal. 


We now turn our attention to the simulation of the reconfigurable 
mesh by the mesh-of-trees. During the execution of an algorithm 
on the reconfigurable mesh the buses are continuously configured. A 
configuration of the bus corresponds to partitioning the mesh into 
disjoint sets of contiguous processors. Reconfiguration of the bus can 
be simulated on the mesh-of-trees by identifying contiguous processors 
in the mesh. This reduces to the problem of identifying connected 1’s 
in an N1/2 x N1/2 digitized image (see Section 5.2 for more details). 


On the mesh-of-trees this can be done in O( 2) time [9, 10]. 


Proposition 4.3 A mesh-of-trees of base size N can simulate a re- 
3 
configurable mesh of stze N in o(= ee ~) tame if the switch set- 


log log N 
2 
tings are dynamic, or O(= aD oa 


log log N + log? N) if the switch settings 


are static. 


We now turn our attention to relationships between the reconfig- 
urable mesh and the pyramid computer. A pyramid computer (pyra- 
mid) of size N is a machine that can be viewed as a full, rooted, 
4-ary tree of height log, N, with additional horizontal links so that 
each horizontal level is a mesh. It is often convenient to view the 
pyramid as a tapering array of meshes. A pyramid of size N has at 
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its base a mesh of size N, and a total of 3N —3 processors. The levels 
are numbered so that the base is level 0 and the apex is level log, N. 
A processor at level 7 is connected via bidirectional unit-time com- 
munication links to its 9 neighbors (assuming they exist): 4 siblings 
at level i, 4 children at level i — 1, and a parent at level i+ 1. 

An embedding of the pyramid into the reconfigurable mesh, simi- 
lar to the mesh-of-trees embedding used in Propositions 4.1 and 4.2, 
is used to give the following. 


Proposition 4.4 Any algorithm running in time T(N) on a pyramid 
of size N can be simulated on a reconfigurable mesh of size N in 
O(T(N)) time. 


The ©(N'/4) time solution to the connected 1’s problem on a 
pyramid of size N [8] is used to give the following. 


Proposition 4.5 Any algorithm running on the reconfigurable mesh 
of size N in time T(N) can be simulated on a pyramid of size N in 
O(T(N)N/4) time. This simulation is optimal. 


Theorem 4.1 Any architecture that can be laid out in an N/2 x 
N 1/2 grid and use a planar wiring (assuming wires have unit width) 
can be simulated by the reconfigurable mesh in constant time per unit 
teme of the target architecture. 


4.2 Graph Problems 


The first problem considered in this section is that of computing the 
connected components of an undirected graph with N}/? vertices, 
given as an adjacency matrix. The (i, ge entry of the adjacency 
matrix of the graph is initially stored in processor P;; of the recon- 
figurable mesh. The algorithm that we use is based on the O(log V) 
time algorithm presented for the CRCW PRAM [14]. 


Theorem 4.2 Given the adjacency matrix of an undirected graph 
with N‘/2 vertices distributed so that the (i,j)*” element of the matriz 
1s stored in processor P;; of a reconfigurable mesh of size N, the 
connected components of the graph can be determined in O(log N) 
time. 


The reconfigurable mesh can also be used to provide efficient solu- 
tions to some graph problems that assume unordered edges as input. 


Theorem 4.3 The connected components of a V vertex graph given 
in unordered edge input format, can be computed in O(V'/?) time on 
the reconfigurable mesh of size N, where N/27<V<N. 


Corollary 4.6 A minimal spanning forest of a V verter graph given 
in unordered edge input format, can be computed in o(vi/2) time on 
the reconfigurable mesh of size N, where N27 <V <N. 


Several graph properties can be deduced once a spanning tree 
of the graph is determined [2]. Using Corollary 4.6 and the data 
movement operations presented in Section 3, the following results 
can be obtained. 


orollary 4.7 Given N edges of a graph G with V vertices dis- 
tributed one vertex per processor in a reconfigurable mesh of size N, 
NV2<V <N, in O(V'/2) time, one can 

a) check if G ts bipartite, 

b) compute the cyclic index of G, and 

c) compute the articulation points of G. 


4.3 Image Problems 


Many problems involving digitized images can be solved efficiently on 
the reconfigurable mesh. The input to these problems is an N#/? x 
N}/2 digitized image distributed one pixel per processor on a recon- 
figurable mesh of size N so that processor P,;; has pixel (i, 7). The 
problems that we examine focus on labeling figures (connected compo- 
nents) and determining properties of the figures. The reconfigurable 
bus is used to isolate individual figures so as to be able to efficiently 
extract information concerning multiple figures in a digitized image. 
A subbus is created and dedicated to keep track of all deliberations 
with respect to each figure. 


Theorem 4.4 Given an N1/2 x N'/? digitized image mapped one 
pizel per processor onto the processors of a reconfigurable mesh of 
size N in a natural fashion, in O(log N) time the figures (connected 
components) can be labeled. 


Theorem 4.5 Given an N1/2 x N1/2 digitized image mapped one 


pivel per processor onto the processors of a reconfigurable mesh of 
size N in a natural fashion, in O(log N) time a closest figure to each 
figure can be determined. ; 


Theorem 4.6 Given an N1/2 x N1/2 digitized image mapped one 
pixel per processor onto the processors of a reconfigurable mesh of 
size N in a natural fashion, in O(log? N) time the extreme points of 
the conver hull can be enumerated for every figure. 


Theorem 4.7 Given an N}/2 x N'/2 digitized image mapped one 
pizel per processor onto the processors of a reconfigurable mesh of 
size N in a natural fashion, in O(1) time several geometric proper- 
ties of a set S of pixels can be determined. These properties include 
marking and enumerating the extreme points of the conver hull of the 
points, determining the diameter of the points, determining a small- 
est enclosing boz of the points, and determining a smallest enclosing 
circle of the points. 


5 Conclusion 


This paper considers the reconfigurable mesh as a viable alternative 
to a variety of processor organizations. We have presented efficient 
implementations of fundamental data movement operations for the 
reconfigurable mesh and have shown that it can be used as a uni- 
versal chip, in that the reconfigurable mesh is capable of simulating 
any organization of processors occupying the same area and using a 
planar wiring layout without loss of time. We have also presented al- 
gorithms that show how the reconfigurable mesh can efficiently solve 
a number of graph and image problems using the fundamental data 
movement operations. The running times of these algorithms are 
asymptotically superior to running times of solutions for the mesh 
with multiple broadcasting, the mesh with multiple buses, the mesh- 
of-trees, and the pyramid computer. Further, we have shown that 
there are problems for which solutions on the reconfigurable mesh are 
more efficient than those possible for a PRAM. 
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Abstract 


A dataflow processor architecture is presented which en- 
ables achieving high speed processing for vector data through a 
pipelining technique. A new dataflow concept, “Variable Length 
Token”, is proposed for enhancing data processing capability 
and flexibility. 


A variable length token—-VLT—., is a token set consisting 
of a specifiable quantity of fixed size tokens. Multiple tokens 
to be processed together form a VLT and flow so as to main- 
tain their consecutivity. In the proposed processor, a VLT is 
taken as a unit of processing both in firing control and in oper- 
ations. This technique reduces the inter-token synchronization 
and communication overhead common to conventional dataflow 
machines, and it facilitates the handling of composite data, such 
as multiple precision data, vector data, and structured data, in 
a static dataflow model. 


Also discussed is a system architecture with the multiple 
processor elements able to operate in parallel. System perfor- 
mance analysis results show that the system is particularly well 
suited to pattern processing. 


1. Introduction 


A data-driven computation model, in which the activation 
of an instruction execution is determined by the availability of 
its operands, can efficiently extract and exploit the concurrency 
inherent in computation [1]-[3]. Since the model was proposed, 
several computers with dataflow architecture have been pro- 
posed, designed and became operational in various fields [4]-{6]. 


Dynamic architecture is adopted in several machines, in- 
cluding, for example. the University of Manchester’s machine 
[7], the machine by MIT’s Arvind and his group [5], and the 
SIGMA-1 by the Electrotechnical Laboratory in Japan [8]. 
They are intended to cover rather large scale computation with 
big hardware—-and sometimes multiprocessor—system. 


NEC’s TIP ( Template-controlled Image Processor) project 


[9][10] has designed and developed a static dataflow VLSI 


“ImPP” (uPD7281), which can achieve high speed processing 
on a sequence of data through a pipeline approach and thus 
is especially well-suited to image processing applications [11]. 
Instruction level parallelism is achieved by pipelining the exe- 
cution of operations in ImPP. Accordingly, this fine-grain data- 
driven parallelism enables efficient utilization of the processing 
(functional) unit, which leads to increased performance, for “ir- 
regular” data as well as regular vector data. In an ImPP mul- 
tiprocessor system, moreover, individual ImPPs, sequentially 
connected in an array, can execute programs in parallel. The 
system has consequently been able to prove its high speed pro- 
cessing capability on a large amount of data in such applications 
as image processing, etc. [12][13]. 
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Pattern processing, on the other hand, deals with a huge 
amount of two dimensional data and thus requires high speed 
computation capability, which has so far only been met with 
special and expensive hardware. Nevertheless, the practical 
systems to cover it have to be handy and with high cost- 


‘performance. The computations involved in such applications, 


furthermore, may demand extreme flexibility and a high level 
programmability for the processors. 


Pipelining technique, which permits concurrent operations, 
is an effective approach to meet those demands and is widely 
used to attain high performance with comparatively small hard- 
ware. To assure that the pipeline hardware would be flexible 
and optimally utilized, data-driven control is an appropriate 
measure. 

In a conventional dataflow approach, however, and espe- 
cially in static models, the unit of computation is always a single 
data. packet—-known as a token—; at the most only two to- 
kens can be processed at one time. To handle composite data, 
such as multiple precision data, vector-type data, or structured 
data, svnchronization and communication among those com- 
posite data tokens have to be programmed by combinations of 
dyadic operations, overhead of which leads to a large amount of 
token flow traffic and thus degrades the performance. 

This paper describes a dataflow pipeline processor architec- 
ture employing the authors’ newly proposed “Variable Length 
Token” technique to overcome the problems described above. 
A variable length token (WLT) is a token set consisting of a 
specifiable quantity of fixed size tokens: the proposed processor 
“V-TIP” (Lemplate-controlled Image Processor with VLT) deals 
with it as a unit of computation. Tokens in a VLT always ftow 
consecutively in the V-TIP processor element and in the V-TIP 
multiprocessor system, which enables composite operations. 

In Section 2, the V-TIP processor architecture is explained 
in detail; Section 3 covers the concept, implementation, and re- 
sulting usefulness of the variable length token. The architecture 
for a dataflow processing svstem with this proposed processor is 
presented in Section 4, which further explains how the parallel 
processing is implemented. Finally, the V-TIP system perfor- 
mance is discussed in Section 5. 


2. Advanced Dataflow Processor V-TIP 


Pipeline and Dataflow 

In pattern processing, such as image processing and pat- 
tern recognition, and in general purpose numerical computation, 
such as numerical simulations and solution of large systems of 
equations, vector data is the major element in the computations. 
To handle efficiently a large amount of uniform and sequential 
data like vectors, a pipeline technique is very convenient and is 
widely used in all kinds of high speed processors. Pipeline tech- 
niques may be classified into two categories: one is instruction 


level pipelining, in which data items flow through functioning 
elements that are lined up in sequence and operate in parallel; 
the other is more fine-grained pipelining, in which each instruc- 
tion execution is partitioned into multiple stages of subfunctions 
that operate in parallel on each data item in a pipeline manner. 


With both of these approaches, however, though designs 
can be optimized for high level performance in specific appli- 
cations, the number and allocation of stages are fixed, which 
considerably limits flexibility. 

In the system proposed in this paper, in order to make 
pipelining more flexible and programmable, dataflow architec- 
ture has been adopted in the processor. In the dataflow ap- 
proach, a tag to identify the data accompanies the data itself 
through the pipeline. A single data item with its tag is called a 
token, the unit to be handled and processed in dataflow comput- 
ers. As Fig. 1(a) illustrates, a token in our architecture, when in 
the V-TIP processor element, comprises a Link Table Address, 
a flag called control flag, and a data value. A Module Number 
for specifying the processor of destination is affixed while the 
token is on the outer system bus (outer TIP bus) outside the 
processor elements (Fig. 1(b)). 


Link 
Ca) Table 
Address 
Module Link 
Cb) Number 


Table 


Address 


Figure 1. V-TIP token format. 

(Control Flag field contains a “VLT Flag”.) 
(a): when in a V-TIP processor element 
(b): when on the outer TIP bus 


The Link Table Address is an address for accessing the 
program table—LT( Link Table)—in the processor element. The 
control flag specifies the class of the token—whether it is a pro- 
gram load token, status dump token or an executable token —, 
and gives information about VLT structure. 


With this dataflow pipeline architecture, there is no need 
for the processor to “fetch” instructions between carrying out 
them, allowing all the pipeline stages to operate fully. 
processor will maintain maximum performance, so long as data 
is constantly fed to the pipeline. Thus, vector data processing 
is especially efficient, since its flow is uniform and consecutive. 
This makes this particular architecture especially well-suited to 
pattern processing and scientific numerical computations that 
must handle large amounts of vector and matrix data. 


Overall Architecture 


Figure 2 shows the architecture of the proposed proces- 
sor, V-TIP (Lemplate-controlled Image Processor with Variable 
length token). 

As Fig. 2 shows, V-TIP is composed of multiple functional 
modules which operate upon the flowing tokens completely in 
parallel. Some of the modules, moreover, are partitioned into 
several stages to achieve more concurrency and to speed up 
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Figure 2. V-TIP architecture. 


the pipeline cycles. A token flows along the outer TIP bus 
and goes into the processor through a Bus Input Controller 
(BIC). Tokens on the outer TIP bus each contain data to be 
processed, a Module Number (MN) to identify their processor 
of destination, a Link Table Address (LTA) to refer to the inner 
program table in the destination processor, and a control flag 
(CTLF). Part of the control flaga—the “VLT flag field”— is also 
used to identity the tokens in VLT from others tokens. VLT 
implementation and usage will be described later. 


If the destination of the token, indicated by its MN field, 
matches the MN assigned to the V-TIP beforehand, the token 
is sent into its Link Table (LT). If the destination is not for this 
V-TIP, on the other hand, the token is sent directly back to 
the outer TIP bus through the Bus Output Controller (BOC) 
as a “pass.token”. The Link Table maps the Link Table Ad- 
dress (LTA) field of the incoming token to a next LTA and to 
a Function Table (FT) access address. The token is sent to the 
FT through the Function Table Queue. 


Firing Control and Operand Fetch 

In the Function Table (FT), the incoming token undergoes 
firing control —wherein it is checked to determine if the two 
operands needed for the dyadic operation are both available—, 
and the waiting token, if it exists, is extracted from the Data 
Memory (DM). The V-TIP employs “queued architecture”” and 
thus multiple tokens —or multiple VLTs—can wait in a first- 
in-first-out (FIFO) buffer area allocated in the DM. 


The firing control at the Function Table and Data Memory 
enables both inter-token synchronization and inter-VLT syn- 
chronization. A VLT can wait for the corresponding VLT and, 
when they match, they are sent to the Processing Unit together. 
By use of this matching mechanism, two VLTs, of length M and 
of NV, can be synchronized and concatenated to construct a VLT 
of length M+ N. (Here, “A VLT of length L” means, naturally, 
“a VLT consisting of L tokens.” ) 


Token generation is needed in the Function Table, in case 
a longer VLT than the input token is needed as a result of 
firing control, as in the above-mentioned VLT concatenation for 
example. While the generation is carried out in the Function 
Table, a busy signal from the Function Table keeps the Function 
Table Queue from sending successive tokens to the Function 


Table. 


When the token pair is fired in the Function Table, it then 
fetches a PU instruction code there and is sent to the Processin 
Unit Queue (PUQ). The PUQ serves as a token buffer and has 
tokens wait till the Processing Unit accepts the next token in- 
put. This buffering is indispensable to smooth irregularities of 
processing and thus to attain effective utilization of the process- 
ing power. 


Processing Unit Architecture 

The Processing Unit (PU) consists of a Multiplier Proces- 
sor (MLP), an ALU Processor (ALUP) and a Token Formator 
(TE), that are connected sequentially and operate in a pipeline 
manner. 

The Multiplier Processor (MLP) contains a multiplier and 
executes one-word by.one-word multiplication, bit shift and bit 
rotate operations on fixed point format data. An adder for the 
exponent parts of floating point format data is also provided to 
carry out floating point data multiplication. 

The ALU Processor (ALUP) performs arithmetic and log- 
ical operations. The ALUP has registers and carrys out status 
dependent processings using registers. The register is used, for 
example, to accumulate values or to detect the minimum value 
in the sequence of tokens. 


The Token Formator (TF) makes up a result token data 
from the ALUP output by normalizing floating point data for- 
mat. The LTA field for the result token which indicates its 
destination, can be modified according to the result of the PU 
operation, indicated by the status flags of the ALU, causing the 
token to branch conditionally to other destinations. 

The Processing Unit can also be used to generate multiple 
tokens out of one given token. It is used mainly to: 

1) Copy and distribute result data to different destinations. 
2) Make a sequence of tokens (like tokens with data ‘0’, ‘1’, 
‘2’, .. , 15’), when given the initial data 0, the difference 
1, and the length of the sequence 16. 
3) Generate VLTs out of given tokens, which is accomplished 
by modifying the VLT-flag field for each token. 
When multiple tokens are being generated, the Processing Unit 
sends a busy signal to the Processing Unit Queue; accordingly, 
the Processing Unit Queue stops its output to the Processing 
Unit. 

After a set of operations is applied to a token in the Pro- 
cessing Unit, the PU output token is sent either to the Link Ta- 
ble or to the Bus Token Table, depending on the code. If more 
operations are to be performed on the token in this processor, it 
is accepted by the LT, goes on to the next operation, and keeps 
on going around the Inner V-TIP Ring (circular pipeline), which 
consists of the LT, FTQ, FT, DM, PUQ and PU, until finally 
outputting to the outer TIP bus. 


Token Output 

If the token from the Processing Unit is to be sent to the 
outer TIP bus, it is passed on to the Bus Token Table (BT'T). 
In the BTT, the token accesses the table and fetches a set of 
identifiers. a Module Number (MN), and an Link Table address 
(LTA) that are needed for tokens on the outer bus. The token 
is then sent to the Bus Output Queue (BOQ), where it is kept 
waiting while the outer TIP bus is busy, and to the outer TIP 
bus through the Bus Output Controller (BOC). In the BOC, 
the pass token flow from the Bus Input Controller and the out- 
put token flow from the Bus Output Queue are controlled and 
merged. 
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3. Variable Length Token 


VLT Concept 

In conventional static dataflow machines, the unit of com- 
putation is always one token. Therefore, to handle composite 
data, such as multiple precision data, vector data and structured 
data, it is necessary to program explicitly the synchronization 
and communication among those data tokens by combinations 
of dvadic operations. The reduction in the effective utilization 
of processing power, due to this synchronization overhead, is 
one of the biggest problems in fine-grained dataflow machines. 

When adding two double precision values with fixed point 
format using conventional dataflow machines like the ~PD7281, 
for example, a flow graph, such as that illustrated in Fig. 3, is 
needed to describe the program. This flow graph shows that the 
token has to go around the inner ring of the processor sequen- 
tially twice. Moreover, the lower word processing and higher 
word processing should be programmed separately, considering 
synchronization between them. 
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Figure 3. Dataflow graph for double-precision 
data addition in a conventional dataflow ma- 
chine . 


The variable length token technique, newly proposed here, 
provides a solution to these problems. A variable length token 
(VLT) is a token set consisting of a specifiable quantity of fixed 
size tokens. The tokens in a VLT always flow consecutively in 
the V-TIP processor element and in the system using the pro- 
cessor. The VLT is taken as a unit of processing both in firing 
control at the Function Table and in operations at the Pro- 
cessing Unit just as a single token is considered in conventional 
dataflow machines. 

The VLT technique enables flexible flow control and en- 
ables high speed processing of composite data structures in a 
static dataflow model. It provides the following three major ad- 
vantages, details of which will be explained in the next section: 

a. Overhead reduction 
To reduce the communication overhead between tokens, 
the VLT technique provides a way to increase data granu- 
larity by concatenating relevant data set together. 
Functionality in vector operations 
The functionality of vector operations can be assured by 
the consecutivity of the data items involved, which are 
chained ina VLT. Vector data can, therefore, be efficiently 
handled with the help of registers, in case of data sequence 
accumulation. for example. 


c. Affixation of control information 
Appending an index or a relevant address by means of VLT 
allows the control information to go with the data token, 
when the token changes its path depending on its data 
value. This technique facilitates the re-ordering for token 
streams when they are to be merged after branching. 


VLT Implementation 

A token has a VLT flag for VLT identification, which is a 
part of the control flag field, besides a data value and a Link 
Table Address. Tokens in a VLT, though they may have dif- 
ferent VLT flags, have the same Link Table Address and are 


destined to the same node in the dataflow graph. The VLT flag | 


of a token indicates whether: 
a) It is the last (tail) token for a VLT or it is a single-token- 
VLT. 
b) It is in the midst of a VLT, so that the next tokens must 
be handled consecutively in regard to the previous token. 
c) The token has a special meaning, such as an “index” token, 
which will be explained later. 


VLT consecutivity is assured by controlling the merging 
of the two token streams using the information included in the 
VLT flag of the tokens involved. For example, in the Link Table 
(LT), where tokens from the Bus Input Controller (BIC) and 
from the. Processing Unit (PU) merge, the LT checks the VLT 
flag of the token currently being sent to the Function Table 
Queue, then determines from which direction it should accept 
the next token. If the flag indicates the token is not the tail 
of the VLT, the next token must be accepted from the same 
direction as the previous one, since the two belong to the same 
VLT and the sequence may not be broken. If it is the tail token, 
on the other hand, then the direction of the next token may 
be determined according to the priority rule, as there are no 
other restrictions. The same control scheme is employed at the 
Bus Output Controller, which accepts tokens from Bus Output 
Queue and Bus Input Controller. 


In the Function Table (FT) and Data Memory (DM), as 
has been explained in Section 2, the firing (enabling) control 
for tokens is carried out. In token synchronizations by firing 
control, VLTs are regarded as one. token. When two VLI's are 
to be synchronized, the one that comes to the FT first waits 
in DM till the other comes. The two VLTs then match each 
other in the FT and go to the Processing Unit together. As is 
shown in Fig. 4, the Queue-Concatenate operation allows two 
VLTs, of length M and length N, to synchronize, and then to 
construct a VLT of length M+ N by concatenation. 
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Figure 4. Synchronization and concatenation 
of two VLTs. 


212 


VLT: Use and Effectiveness 


The following presents some examples of the use and ef- 
fectiveness of the VLT technique in handling composite data 
in various processings, in comparison with conventional tech- 
niques. | 


i) Multiple Precision Data Multiple precision data, like 
double precision or quad precision data, can be represented us- 
ing VLTs. In the case of double precision data, a VLT consisting 
of two tokens, one token with the lower word data at the front 
and one with the higher word data in the back, represents a data 
item (See Fig. 5 (a)). When two double precision data are to be 
added, for example, two operand VLTs meet in corresponding 
order in the Function Table and the Data Memory. (In Fig. 5, 
an operand VLT, with data opd2 L and opd2 H —the operand2 
low word and high word, respectively—, had been waiting in the 
Data Memory.) In FT, at the same time, the tokens fetch an 
PU instruction code add-multiple (Fig. 5 (b)). 

The token with lower operands goes into the Processing 
Unit (PU) first, two operands are added there, and the carry 
of the addition is kept in its carry register. The tail token with 
the higher word operands then goes into the PU, where another 
addition is performed. But this time, the carry from the lower 
words addition.is added at the same time. The resultant data, 
as in Fig. 5 (c), is sent out from the Processing Unit in the same 
form as in the start of the previous operation: the two token 
VLT, with the lower word in the front and the higher word in 
the back. 
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Figure 5. Token format transition in double- 
precision data addition with VLT. 


With this technique, a data with multiple precision can 
be handled in the same manner as a single precision ordinary 
data. The processing cost for synchronization between lower 
and higher tokens with a conventional dataflow approach can 
be eliminated, as may be seen in Fig. 6, a dataflow graph il- 
lustrating a program for double precision data addition. This 
figure, for a processor using the VLT technique, may be com- 
pared to the graph in Fig. 3, for a conventional case. The time 
lag between the start of and end of an operation (latency) may, 
moreover, be reduced by half. 


Figure 6. 
precision data addition in the proposed 
V-TIP. 


Dataflow graph for double- 


This approach can be applied to multiple (longer than two) 
word data addition. A similar technique, moreover, can cover 
multiplication of a multiple word data (A1,Ae,..., An) with a 
single word data B. In the latter case, each word of the ‘A’ data, 
Ay, is multiplied by B in the Multiplier Processor, and then, in 
the ALU Processor, the lower part of the product P!°’),, the 
higher part of the previous product P{"’9"),_1,, and the carry of 
the previous addition C;,_, are added. In order to multiply two 
multiple word values, moreover, the enabling section (the Func- 
tion Table and the Data Memory) provides facilities to decom- 
pose one VLT to single component tokens before multiplications 
and add the products with word shifts afterwards. 


ii) Vector Accumulation A set of vector data can be rep- 
resented by a VLT, and by using registers in the Processing 
Unit, the summation of the data in a vector can be obtained. 
One practical example is the calculation of the inner product of 
two vectors, which is often used in matrix multiplications and 
convolutions in spatial filters. 


To calculate the inner product sy of vector a; and vector 0;, 
for i=1 to N, one must calculate sequentially, with conventional 
machines such as the u«PD7281, as follows: 


sy, = a1 by = aobs 
so = te» + $y ts = a3b3 
8n-1 =tn-1+5Nn-2 in = anbn 


Sn =tn + 5Nn-1. 


In this case, the whole operation is carried out using combina- 
tions of dyadic operations and takes 2N — 1 processing clocks, 
and the latency is N —1 times the number of steps required for 
the token to go around along the inner loop of the processor 
(with wPD72s81, it is 7 steps). 

In the V-TIP proposed dataflow processor, a vector is rep- 
resented by a VLT with the same word length. In this case, 
a VLT consisting of N tokens with data a,,...,ay comes into 
the processor from the outer TIP bus, while the coefficients 
bi,...,0n reside in the Data Memory. A token with data a, 
in the incoming VLT matches(and fetches) the corresponding 
coefficient 6b, in the Data Memory, and the two operands go to 
the Processing Unit. 


In the Multiplier Processor of the PU, data a, and 6; are 
multiplied. Immediately after the multiplication, the product is 
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added to the value in the register of the ALU Processor, whose 
initial value is 0. Accordingly, after the multiplication of the 
last N-th vector components, the product ayby is added to the 
value in the register, which is the (N-1)th partial sum sy_,, 
and the inner product sy, is obtained. The token with the total 
sum is then sent out and the register in the ALUP used in the 
operation is cleared to zero for the next use. Consequently, 
calculation of the inner product of two vectors of length N can 
be accomplished in N processing clocks. Moreover, the latency 
is only N clocks. 


This improvement has been made possible by the fact that 
the Processing Unit has a multiplier (in the Multiplier Pro- 
cessor) and an adder with a register (accumulator, in the ALU 
Processor) in sequential order, both operating in parallel. Thus, 
the two component operations, multiplication and addition, can 
be executed in a pipeline manner. 


One thing that should be noted here is the use of a regis- 
ter in dataflow machines. One of the advantages of the dataflow 
computation model is the referential transparency ensured by 
the functionality of operations. The use of registers in dataflow 
machines is, therefore, generally considered harmful, as it causes 
side effects and detracts from the above advantage, although the 
introduction of registers does serve to speed up the processing. 
In the proposed V-TIP architecture, however, the status de- 
pendencies are confined in VLTs, as the data on the registers 
is set, referenced and modified only by the tokens in the same 
VLT-—which always flow consecutively—, and is cleared by its 
tail token. Thus register use has no side effect on tokens, other 
than those in the involved VLT. Consequently, users will be 
able to benefit from the high speed computation enabled by the 
registers, without having to give consideration to the exclusive 
register utilization or to program specifically the management 
of inter-token synchronizations, in order to avoid side effects. 


ii) Indexing In this V-TIP. as in wPD7281, the sequence 
of tokens flowing on the same arc in a fixed order and carrying 
out same instructions is called a “stream data.” Data in this 
stream will be processed as intended, if and only if the order 
of tokens in the stream is preserved. Data in a stream are, for 
example, retrieved from the memory, undergo several instruc- 
tions and are finally stored in the memory in the correct place, 
because the final order of the data tokens in the stream is the 
same as that of the prepared address tokens. 


However, when some of the tokens in the stream condition- 
allv switch —as a result of a conditional-branch instruction— 
, depending on the token data value and take a different dataflow 
path from that of others, then the token sequence from different 
paths cannot satisfy the above condition after the paths merge. 
This is because the steps, which the tokens on different paths 


require to run from the branch point to the merging point, differ 


depending on the path involved. 


In this case, it is possible to make a VLT of a token pair, a 
data token and the address token with which the data is finally 
to be stored in the memory. (Here, the second token carrys 
control additional information for the first data token, so can 
be called an “index token”.) This technique causes the index 
token with the address to “accompany” the data token along 
the same path, and makes it possible to ensure the correctness 
of memory storage in whatever order the data is written into the 
memory, when this write action takes place after the processing 
is over. 


A token with a number, instead of a memory address, to 
indicate the original position of the data token in the stream, 


can also be appended to it as an index token in the form of a 
VLT. This number can be used to re-order the tokens using a 
temporary buffer. | 


Figure 7 shows how this is done. An index token with 
“Context-k” (k = 1,2,...,N), which denotes the original posi- 
tion of the “Data-k” token in the sequence, is affixed to the data 
token using VLT. It always goes with the data token but passes 
through the Processing Unit with no-operation being carried 
out. When branching depending on the data value occurs the 
Link Table Address field —-which serves as a token identifier— 
of the index token is modified accordingly, and keeps on taking 
the same route as the data token. Finally when the VLTs from 
different paths are to be merged, they come into the same node 
and there the data tokens are written into the memory at the 
address indicated by the “context” in the index token. After all 
the tokens arrive and are stored into the memory, the original 
data stream, Data-1, Data-2, ..., Data-N, can be obtained by 
reading it sequentially. 
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Figure 7. Context addition to data tokens 
using VLT. Tokens from different paths can 
be re-ordered using the indexes appended to 
the data tokens. 


This technique facilitates the handling of stream data with 
less consideration for execution timing and provides an advan- 
tageous way to high level language programming. 

In another example, when subroutines are to be called from 


different places, the information about the calling sites has to 
be preserved to distinguish the invocations and to return to the 
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original calling sites. The user, in that case, can append con- 
text information, an identifier for the invocation, to the data 
token as an index token by means of VLT. (It should be noted 
here that the tagged-token approach used in dynamic archi- 
tecture machines, which is common and effective in handling 
reentrancy, requires complicated hardware and is not appropri- 
ate for implementation on simple VLSI oriented machines like 
V-TIP [6].) 

As has been shown, any additional control information as- 
sociated with the data can be affixed to the data token by an © 
index token using VLT. While these index tokens pass the Pro- 
cessing Unit with no-operations carried out and the indexes un- 
touched, these no-operations result in waste of the PU com- 
putation power. Therefore, during the operations that do not 
change the order of the stream data—which is often the case—., 
the index token can be removed from the data token and can 
wait in the memory. When it is necessary, the data and index 
tokens are synchronized and concatenated again. 


4. V-TIP Multiprocessor System 


The authors propose here a system with V-TIP dataflow 
processors for practical implementation of the VLT technique. 

The basic system, shown in Fig. 8, consists of multiple data 
flow processor elements (V-TIPs) and an “Interface Processor” 
(IFP), connected by a ring-shaped outer TIP bus. The Interface 
Processor, as well as the V-TIP, is data-driven and deals with 
the interfaces among the V-TIP, the Memory Unit and the host 
computer. Individual V-TIPs operate concurrently and inde- 
pendently, depending on the programs loaded into them. They 
access data in the Memory Unit by sending read and write re- 
quest tokens to the IFP.. 
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Figure 8. V-TIP Multiprocessor System. 


At the very beginning, the V-TIP program is stored in the 
Memory Unit in the form of program load tokens. When the 
user gives a program load command through the host computer, 
the IFP begins reading program loading tokens in the Memory 
Unit and sending them to individual processors. Tokens reach 
the V-TIP, which is denoted by its Module Number field, and 
the contents of the tokens are set to the local memories in the 
processor. i.e. to the Link Table, the Function Table, the Data 
Memory, and the Bus Token Table. 

When the user commands the processing start by send- 
ing tokens to V-TIPs, V-TIPs begin operating by sending data 


read request tokens to the IFP; the IFP then fetches the data 
from the Memory Unit, constructs tokens and sends the data 
sequences (streams) to the requesting V-TIPs. The tokens flow 
into a V-TIP, circulate along the inner ring bus, undergo the 
pertinent programmed operations, and are sent out to the IFP 
as data write tokens when the processing is over. The IFP 
automatically generates sequences of write addresses, and the 
arriving stream data are stored in the Memory Unit in the ap- 
propriate positions. 

Generally the clock rate with which the processor element 
operates and the token transmission rate on the outer TIP bus 
(50 nsec to 100 nsec) is much faster than the memory access 
rate at the Memory Unit, i.e. around 400 nsec to 600 nsec with 
Dynamic RAM. (With Static RAM, of course, the access time 
is comparable to the clock cycle. However, for achieving high 
cost-performance in a compact system with a large amount of 
memory, Dynamic RAM is more realistic.) Thus, a technique 
to enhance the total memory access rate is needed to make the 
most of the processing speed available when using the V-TIPs. 


In the proposed Interface Processor for the V-TIP system, 
simultaneous memory access by interleaved memory is used to 
permit rapid access to the Memory Unit. The Memory Unit for 
the system comprises 16 memory modules and supports parallel 
access to any consecutive addresses, in both horizontal and ver- 
tical sequence of 16 points on a 2-dimensional area. When the 
sequential memory read is requested by a V-TIP, for example, 
the Interface Processor generates access addresses, sets them 
in the registers each corresponding to the 16 memory modules, 
and gives the read signal. The retrieved data are copied into 
the data registers. Each register set has two groups of registers; 
there are 16 plus 16 read address registers, for instance. Those 
two groups are used like swinging (double) buffers. 


By connecting multiple V-TIPs on the ring bus, as with 
this system, the system performance is enhanced in proportion 
to the number of V-TIPs, since the processors can work indepen- 
dently and concurrently. In such a case, of course, performance 
degradation due to a bus bottleneck and shared-memory ac- 
cess contention has to be considered. Moreover, when memory 
access is so frequent that it imposes the limit of system per- 
formance (for example, the interleaved memory cannot provide 
increased speed for random access to the Memory Unit), multi- 
ple Memory Units have to be and are able to be connected to 
the outer TIP bus. In this case, each Memory Unit is connected 
to the outer TIP bus through an IFP, and is assigned a different 
module number for accessing. 


As has been explained, the V-TIP system has concur- 
rencv in two levels; individual instructions are partitioned into 
pipeline stages that function in parallel, and processors in the 
svstem function in parallel as tokens flow. 


To meet larger-scale computation requirements, a more 
highlv-parallel system can be built by connecting a number of 
the above TIP ring units (multiple V-TIP elements and an In- 
terface Processor connected by the outer TIp bus) together. 
The Interface Processor (IFP) has communication ports that 
exchange communication packets in a byte-serial manner with 
IF Ps in the neighboring TIP ring units. Thus, this architecture 
permits construction of a massively parallel system, suited to 
the needs of the user. 


5. System Performance 


The V-TIP system performance was evaluated by analyz- 
ing and simulating basic image processing and numerical com- 
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putation application programs. 


Evaluation 


Evaluation was begun by making a dataflow graph for an 
application computation, the equivalent of a flow chart and a 
program description for control-flow machines, and then sum- 
ming up the clock cycles needed for the Processing Unit to cover 
the whole process. Underutilization is analyzed by estimating 
the number of idle steps during the execution. The number of 
idle steps can be approximated by calculating the quantity of 
token flow on arcs and analyzing the dependency among the 
instruction nodes in the dataflow graph. This is because if only 
one token flows between two nodes and if there is no other paral- 
lelism, for example, the instructions on those nodes are executed 
sequentially and cannot be pipelined, thus leading to an idling 
of the Processing Unit between two instruction executions. 


Assumed Parameters 


In the performance analysis, the pipeline cycle for the 
dataflow processor V-TIP was assumed to be 40 nsec, token 
transmission rate on the outer TIP bus to be 80 nsec/token, 
and memory access speed to be 400 nsec/access. (Remember 
that the memory unit is 16-way interleaved, so it can both read 
and write data in 50 nsec, if the access is sequential.) This 
means that a V-TIP can attain a maximum performance of 25 
MOPS if the stream data are fed to the processor element at a 
sufficient rate and its Processing Unit functions fully. 


Applications and Speeds 
a) Spatial Filter In a spatial filter operation with a 3 x 3 


kernel on a 256 x 256 pixel image, three VLTs containing 3 pairs 
of data —for 3 neighboring points—and the coefficients effec- 
tively utilize the pipeline hardware. The nine-point convolution 
operation for one output pixel requires 23 clocks of the Process- 
ing Unit, i.e. 920 nsec/pixel (40nsec x 23). The number of mem- 
ory access requests from the V-TIP to the Interface Processor 
(IFP) can be reduced to only a pair of read and write requests 
per unit convolution by storing three lines of the original im- 
age in the Data Memory. In other words, a unit convolution 
occupies 160 nsec in the bus transmission and 100 nsec in the 
memory access. This means that this computation is processing- 
bound, not memory access bound or outer TIP bus transmission 
bound, and can be speeded up by using more V-TIPs in parallel. 


‘Since the token transmission rate is 160 nsec/unit (One read- 


request token and one write-data token go from a V-TIP to the 
IFP and this becomes the bottleneck), maximum overall perfor- 
mance can be attained when 6 processors are used in parallel, 
this being 160 nsec/unit, i.e. 10.5 msec (160nsec x 256 x 256) 
for a 256 x 256 image. 


b) FFT A 2-dimensional complex Fast Fourier Transform 
was implemented with constant geometry algorithm on a 512 x 
512 image of floating point format data. One butterfly requires 
4 input tokens and 4 output tokens, taking 320 nsec (80nsec x 4) 
for token transmission. One butterfly calculation consumes 18 
PU clock cycles, that is, 40nsec x 18 = 720nsec. Accordingly, a 
two V-TIP system can complete this computation twice as fast 
as one V-TIP system, namely 360 nsec/butterfly. Thus, using 
this two V-TIP system, the whole operation takes 360ns x 128 x 
256(lanes) x 9(stages) x 2(directions) = 0.21 sec. 


c) Character Recognition To demonstrate the suitability 
of the V-TIP svstem for pattern processing, character recog- 
nition processing performance was studied. This method for 


recognizing printed alphanumeric characters, based on multiple 

discriminant analysis, includes the following procedures [14]: 

. Resample of the input image by a 5 x 5 spatial filter, thus 

composing a 50-dimension feature vector. 

. Reduction of the dimension of the feature vector by multi- 

cere with a 50 x 48 matrix. 

Calculation of the distances for each of the dictionary vec- 

tors. 

*, Detection of the nearest matching vector in the given dic- 
tionary by comparison of the distances. 


The first and second steps each take 0.2 msec. In the third 
and fourth step, tokens for dictionary vector elements flow into 
the processor one after another. Differences from the corre- 
sponding elements are accumulated and then compared to the 
existing minimum. Each distance calculation takes 120 PU cy- 
cles and this is repeated 91 times, the number of dictionary 
vectors; thus, the last two steps calls for 0.44 msec. 


ill. 


The above evaluation shows, therefore, that a system with 
one operating V-TIP can perform printed character recognition 
at the rate of (0.2+0.2+0.44 =) 0.84 msec/char with a 91- 
character dictionary of 48 dimensions. This performance, 1190 
characters/sec, is about 240 times faster than that of a 16 bit 
micro-computer system implementing the same algorithm[14]. 


6. Conclusion 


The proposed processor, V-TIP, consists of multiple func- 
tional modules operating in parallel. The inherent concurrency 
in computations is extracted and effective utilization of the Pro- 
cessing Unit is possible by employing dataflow scheme compu- 
tation. The pipelining technique enables high speed processing, 
especially of vector data. 


A new dataflow concept, the Variable Length Token (VLT) 
technique, is introduced to enhance data processing capabil- 
ity and flexibility, in which multiple tokens are controlled so 
that they flow consecutively throughout the system and are 
processed together. This technique, by providing a means to 
increase data granularity, reduces the inter-token synchroniza- 
tion and communication overhead for conventional dataflow ma- 
chines. Thus, it facilitates high speed processing of composite 
data structures in a static dataflow model. Its effectiveness has 
been shown in multi-precision data computation and re-ordering 
of tokens. . 


The enabling section of the V-TIP offers measures to con- 
trol VLTs, such as allowing two VLTs to synchronize, concate- 
nating them and decomposing one into single component tokens; 
the Processing Unit operates on VLTs. 


The architecture of a system with multiple V-TIP process- 
ing elements, an Interface Processor (IFP), and a Memory Unit 
has been explained. V-TIPs in the system can operate concur- 
rently and independently, exchanging tokens with each other 
and with the IFP. The IFP supports interleaved parallel access 
to the Memory Unit so that the V-TIPs are supplied tokens at 
a sufficient rate. 


V-TIP system performance has been estimated by analyz- 
ing application program executions. Results indicate that the 
system with the VLT technique produces high performance for 
vector-type data. The proposed architecture appears especially 
suitable for pattern processing. 
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Abstract -- In computer graphics, the ray tracing 
algorithm can generate highly realistic images. However, 
it has a major disadvantage: the high computational 
expense associated with generation of an image. To 
increase the image generation speed, our approach is to 
map the computation processes of ray tracing into a 
specialized dynamic data flow architecture for parallel 
processing. To support the computation further, we used 
a spatial-information hierarchy. A model for ray tracing 
computation based on probability is used to analyze the 
load. The architecture is modeled with a closed queueing 
network. Through the analytical models, we have studied 
the relative performance of the architecture under various 
load conditions. 


1. Introduction 


Dataflow architecture is an alternative to Von Neumann 
architecture and is capable of efficiently exploiting a mas- 
sive amount of parallelism inherent in many types of com- 
putation. A dynamic dataflow architecture uses tagged 
tokens to unfold iterative computations so that a high 
degree of parallelism can be achieved [2] [13] [16]. This 
paper proposes a specialized dynamic dataflow architec- 
ture for the ray tracing image generation and investigates 
its performance for the task. 


Ray tracing is a technique for generating images of three 
dimensional objects with a computer. Programs using ray 
tracing can simulate the effects of reflection, refraction 
and shadows to produce computer images that possess a 
strikingly high degree of realism. The technique was ori- 
ginally suggested by Appel [1] and later enhanced by 
Whitted and others to generate images according to the 
laws of optics [18] [12] [4]. 


In this scheme, a ray is fired from the viewer through the 
pixel into the world. The intersection between this ray 
and objects in the world determines the visible surface. 
Shadows are determined by firing rays from the 
intersection point toward the light sources. Two addi- 
tional rays may also be fired from the intersection point 
depending on the surface characteristics, one along the 
reflected direction and the other along the direction of 
transmission. Figure 1 shows the ray tracing terminology 
used in this paper. 


In the rendering process, the color of a pixel is deter- 
mined by an intersection tree. The pixel is the root node 
of the tree. All other nodes in the tree represent ray- 
surface intersection points. Each arc in the tree 


represents a ray used to determine the color of the root 
pixel. A leaf node of the tree corresponds to an intersec- 
tion point between the ray and either a non-reflecting, 
non-transparent surface, or the boundary of the modeled 
world. Figure 2 shows an example of an intersection tree. 
In the case of surfaces aligned in such a way that a branch 
of the tree is very deep (for example, two reflective sur- 
faces in parallel can cause a tree to have infinite depth), 
the branch may be truncated at a predefined depth. This 
is reasonable since the truncated portion contributes very 
little to the color of the pixel. After the tree is com- 
pletely grown, the colors of all nodes in the tree are com- 
puted and are used to find the color of their root pixel. 


Using ray tracing, we may model accurately the distor- 
tions of reflecting and refracting surfaces, thus producing 
highly realistic images. However, there is a major draw- 
back to ray tracing: very high computational expense. 
To determine the color of each pixel requires one to com- 
pute the intersection points between every ray in the tree 
and the surfaces in the scene. One way to reduce the 
amount of computation needed to produce an image is to 
divide the modeled space into subvolumes and to keep a 
note of the surfaces in each subvolume [3] [7] [10].. As a 
ray propagates from one subvolume to the next, the sur- 
faces in each subvolume become candidates for ray inter- 
section. Thus the nearest objects are the first candidates 
for intersection, leading to a quick determination of the 
closest object. One way to partition the space is to divide 
it uniformly. In this scheme, the traversing algorithm 
which traces only the relevant subvolumes is based on a 
three-dimensional scan conversion algorithm. Therefore, 
the time to find a small set of surfaces that potentially 
intersect the ray is O(M), where M is the number of sub- 
divisions on each axis, and it is independent of the 
number of objects in the scene. 


Ulilner proposed a 2-D array of processor elements (or 
PEs) for ray tracing based on the 3D world space division 


_ parallelism model [17]. Each PE contains a general pur- 
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pose processor, an intersection processor, and a memory 
module. Only those surface models intersecting a subvo- 
lume are kept in the corresponding PE that covers the 
subvolume. The two axes of the 3D space are directly 
mapped onto the processor array. The third dimension of 
the partitioning grid must be simulated within each pro- 
cessor in the array. As a ray travels across the space, the 
Yay message travels across the corresponding processors. 
This approach has several disadvantages : (1) high storage 
requirement due to the need to store copies of the same 
object model in multiple PEs; (2) as the number of PEs in 


the system increases, the time for the ray to transverse in 
space increases; (3) difficulty in balancing the load due to 
its rigid mapping between subvolume and _ processor. 
Dippe and Swensen [6] relaxed the mapping function in 
order to balance the load and proposed a 3-D array of 
processor elements to perform ray tracing. However, the 
approach does not eliminate the other shortcomings and 
the mapping function is more difficult to implement. 


In the paper [5] and [14], a different parallel processing 
scheme for ray tracing was proposed. In this scheme, the 
screen is partitioned into subscreens and each subscreen is 
processed by a PE in a multimicrocomputer environment. 
Object models are stored in each PE if their projection 
intersects the subscreen of the PE (called Y-clipping [5]). 
Additional object models may be fetched by a PE if there 
is a need due to the computation for shadow, reflection, 
or transmission. There are several disadvantages to this 
scheme: (1) it explores parallelism only at image level 
(i.e., screen); (2) as the area of each subscreen 
decreases, the number of copies of each object model in 
the system increases, and the benefit of using Y-clipping 
diminishes. 

In viewing the shortcomings of the above architectures, 
we list the desirable features for a parallel ray tracing 
architecture as follows: 


e Allowing parallelism among rays and among different 
computation processes for each ray. 

e Keeping only one copy of each object model in the 
system regardless of the number of PEs used. This 
reduces the memory requirement of the system and 
improves the updatability for the object models in the 
scene. 


e Allowing the empty-space transversal time for each 
ray to be virtually independent of the number of PEs 
in the system. 


To achieve these goals, we proposed a dynamic dataflow 
architecture with a hierarchy of spatial information (loca- 
tions of objects in space) and specialized execution 
modules for ray tracing image generation. 


The next section describes the proposed architecture. A 
model for the ray tracing image generation computation 
based on probability is described in section 3. The queu- 
ing network of the proposed architecture is also described 
in section 3. The results are discussed in section 4. Sec- 
~ tion 5 concludes the paper. 


2. The Architecture 


To explore the parallelism in ray tracing image genera- 
tion, a dataflow graph for ray tracing based on space sub- 
division is presented in figure 3a. The data tokens for the 
operators are shown in figure 3b. Basically, the diagrams 
indicate how to grow an intersection tree from a root 
pixel. The color of a pixel is obtained by accurnulating all 
the color components inside the corresponding intersection 


tree. a 
In this paper, we propose to use a specialized dynamic 
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dataflow architecture for ray tracing image generation. 
The processes in figure 3a map directly onto the modules 
in the dynamic dataflow architecture shown in figure 4. 
This architecture contains six types of modules : Ray 
Generator (RG), Empty Space Processor (ESP), Fetch 
Unit (FU), Intersection Processor (IP), Intersection 
Result Buffer (IRB) and Color Accumulator (CA). The 


modules are connected in a circular-pipelined fashion. 


Because all intersection trees are mutually independent, 
many intersections may be processed simultaneously. To 
allow a high degree of parallelism, multiple copies of each 
type of module may be used. In each module, execution 
starts once the "operand token" is available. If more than 
one "operand token" enters a processor, the processor 
places them in a queue, and executes them one at a time. 
In the system, the "root pixel" is used as the tag for most 
tokens for the purpose of accumulating color for each 
pixel. Tokens from P4 to P6 in figure 3a have longer tags 
due to the unfolding of iterative computation P5. 


The ray tracing system keeps a hierarchy of subvolume 
and surface information. This "spatial-information” hierar- 
chy includes an Object Model Storage (OMS), Subvolume 
Surface Lists (SSL), and an Empty Subvolume Map 
(ESM). (See Figures 3 and 4) The ESM stores 
empty/non-empty bits for all the subvolumes in the scene. 
This map is used to determine the first non-empty subvo- 
lume encountered as a ray travels in space. It allows rays 
to bypass empty subvolumes quickly. The SSL is a list of 
surfaces intersecting a given subvolume. Each entry in 
the SSL contains a surface ID and a pointer pointing to 
the beginning of the surface in the OMS. The SSL, how- 
ever, adds a level of indirection for reading surface 
models; instead of having multiple copies of a surface in 
several subvolumes, now, in a multiple IP-OMS system, 
we need only pointers in multiple SSLs, one for each 
intersecting subvolume. Therefore, we need only one copy 
of each surface description in a multiple [P-OMS system. 


The following describes the function of each module and 
how the architecture works. 


The architecture acts as a display system attached to a 
host computer. After first initializing the system, the host 
computer must program the system with parameters. 
(For example, how to partition the world into subvo- 
lumes, the position and color of every light source, and 
parameters of the viewing pyramid.) Then, the host com- 
puter transfers all the object models and texture maps in 
the scene to the display system. This information is distri- 
butively stored in the OMS to allow parallel accessing. 


The spatial information hierarchy is created by perform- 
ing clipping on all the surfaces in the scene with each sub- 
volume. In this clipping operation, the majority of the 
subvolumes are trivially rejected and only a small set of 
subvolumes becomes candidates for a more detailed clip- 
ping test. Such a test can be performed either by the IP 
or by different dedicated hardware. The results of the 
clipping operations for each subvolume are stored in the 
SSL for the subvolume, and the empty/non-empty condi- 
tions for all subvolumes are stored in the ESM. Another 


alternative to the detailed clipping test is not to perform 
the detailed clipping test at all. In this case, surfaces are 
always included in the SSL of their candidate subvolumes. 
The load for the intersection processor becomes heavier 
due to the longer SSL lists. To avoid counting any inter- 
section point multiple times during image generation, the 
IP must compute and output only those intersection points 
that reside inside the intended subvolume. 


As the number of subvolumes increases, the average 
length of the SSL becomes shorter, and the ray-surface 
intersection computation becomes faster. Thus, the pro- 
posed spatial-information hierarchy can reduce both the 
object storage requirement and the ray tracing time. 


When the initial transferring of the system parameters and 
object models is complete, the RG begins to generate pri- 
mary rays and sends these rays to the ESP to find the first 
non-empty subvolume on the path of the ray. Once the 
ESP finds the first non-empty subvolume, the ray with the 
subvolume number is transferred to the FU. The FU 
identifies all surfaces contained in the subvolume and 
sends a sequence of "ray-surface ID" pairs to the IP. The 
IP retrieves the surfaces from the OMS and computes the 
intersection point between the pair. The IRB then com- 
pares the intersection points from the output of IP and 
keeps only the intersection point closest to the ongin of 
the ray in its register. 


If there is no intersection in a subvolume, the ESP is so 
informed. The ESP finds the next non-empty subvolume 
on the path of the ray. Once the intersection point closest 
to the origin of the ray is found, shadow rays are fired 
from this point toward all the light sources in the scene. 
These shadow rays are sent to the ESP. If the intersected 
surface is reflective or transparent, a reflection ray or a 
transmission ray is fired and sent to the ESP. The color 
of each intersection point is determined by either the IP 
or the IRB, depending upon the type of ray. The color of 
each pixel is determined in the CA by accumulating the 
color of all nodes in an intersection tree. 


3. Analytical Models 
3.1 Load Model 


The amount of system load for ray tracing an image is 
modeled stocastically. First, we find the amount of com- 
putation needed to trace a ray. Then, we determine the 
total number of rays needed to be traced for an image. A 
numerical example is given to illustrate our model. All 
mathematical symbols used in the model are listed in 
table 1. 


Assume that the 3-dimensional modeled scene is uni- 
formly divided into M? subvolumes and that there are N 
surfaces in the scene. 


Let S = average number of subvolumes that a surface 
occupies. Then, S is a function of the ratio between the 
average object surface size and the subvolume size. Let’s 
assume that S is proportional to M’, and call the propor- 
tional constant B, the average surface size coefficient. 
Then, S=B:M?’. 8 is related to the ratio between the 
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average surface area and the area covered by one side of 
the scene. 


Let p be the probability of a surface in a subvolume. 
Since a surface on the average occupies S subvolumes, 
and there are a total of M> subvolumes, 
_S _BM*_ Bp 
pve MOM 
Let f(x) be the probability of having x surfaces in a sub- 
volume. Assuming the surfaces in the scene are similar in 
sizes, and are placed randomly, the probability of having 
a set of x surfaces in a subvolume is p*(1—p)* *. How- 
ever, this is merely one way of having x surfaces in a sub- 


volume. There are a total of different ways of 


X 


selecting x surfaces in a subVolume. Therefore, f(x) fol- 
lows a binomial distribution and can be expressed as: 


f(x) = | }p*(1—p)N 


Let n be the average number of surfaces in a subvolume. 
peor . the binomial distribution, 


M? M 
The proeauny that a subvolume is empty is f(0). 
£0) = |) |p°(G—-p)X-° = (l-p)% 


This is true because f(0) is the probability that not all N 
surface are in the subvolume. 


Let q be the probability of a ray intersecting any surface. 
q= 1-f"(0) 

In our numerical example we will assume 8B = 0.001 and 

M = 200, then p = 5.0 x 10°°, n = 0.005 and q = 

0.632. Figure 6 shows q as a function of B and N. 


If we assume that there are 1,000 surfaces in the scene, 
then f(0) = 0.995. 


Let g(x) be the probability that a ray travels x-1 empty 
subvolumes before reaching a non-empty subvolume. 
g(1)=1-f(0) 
g(2)=£(0)-[1—-£(0)] 


g(x)=[£(0)**-[1-£(0)] 


Assume that we can always find a ray-surface intersection 
once a non-empty subvolume is reached. In other words, 
the IP never rejects a subvolume which is given by the 
ESP because it could not find a surface to intersect. 


Let D = average number of empty subvolumes traveled 
by a ray before reaching a non-empty subvolume. Then: 


M M 
D=M-(1~ 5 26))+ 3 x-a(s) 


M 
=M-™(0)+ S x-[f(O)P! [1-£)] 
x=1 
Using the parameters already assumed in our numerical 
example, D = 127. 


Let {, = average ray generation time (for any type of ray) 
"i = average computation time to by-pass an empty 
subvolume 
ts = average time to fetch a Subvolume Surface List 
t) = average time to fetch a surface from OMS 
t,) = average time to compare intersection points 
t; = average computation time for each ray-surface 
intersection 
t, = average time to determine the color value of 
the intersection point 
t, = average color summation time 
Then, 
Tp = average computation time per ray in a serial 
computer. 
= average ray generation time (for any type of ray) 
+ average time to travel across the empty space 
+ average time to find intersection point in a 
subvolume . 
+ average time to accumulate color for the pixel 
=tgt+ Dt t+tpto(to+t +p +t. +t, 


Let’s call u and w the average percentages of reflective 
and transparent surfaces in all subvolumes respectively. 
Therefore, 1—u—w is the percentage of opaque surfaces. 
Let 1 be the number of light sources in the scene and R be 
the ray tracing resolution of the image. In our example, 
assuming the screen resolution is 512512 pixels and each 
pixel has 4 supersamples for anti-aliasing, then R = 1024. 
Next, we will determine the average number of nodes in 
an intersection tree and the average number of rays traced 
for an intersection tree. 

The root node of each tree represents a pixel point and, 
by definition, it has exactly one primary ray to be traced 
for it. Its probability of intersecting a surface is q. 
Therefore, on the average, it has q child nodes. Among 
them, q:u nodes are created by intersecting a reflective 
surface and q:w nodes are created by intersecting a tran- 
sparent surface. As a result, a total of q:(u+w) rays is 
traced for the next level of the tree. In addition, to deter- 
mine whether the intersection points are directly 
illuminated by any light source, q'l shadow rays need to 
be traced. The same reasoning is used to determine the 
total number of rays traced and the total number of child 
nodes for the entire intersection tree. They are listed as 
follows: 


Total number of nodes per tree 


=1+ S x+1(ytw)k = 1 + ——t— 
24 ( ) 1—q(ut+ w) 


Total number of vision rays traced per tree 
a7 q’(utw)* = 1-quew) 


Total aamnbet of shadow rays traced per tree 
= (total number of node per tree — 1):] 
‘| 
~ 1-q(ut+w) 
Total number of rays traced per tree 
= total number of vision rays 
+ total number of shadow rays in the tree 
_ 1tql 
1—q(u+w) 
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Let B be the total number of rays for the final image. 


B=(the total number of intersection trees) 
: (the total pa of rays traced per tree) 
=R2. ee 
1—q(u+w) 

where the ratio between the total number of vision rays 
and shadow rays is equal to 1: ql. Figure 7 shows B as a 
function of the number of light sources in the scene (1) 
and q. 


Therefore, the estimated total ray tracing time for gen- 
erating an image based on space subdivision on a serial 
processor is ®, where 


® = Tp‘B 
= [tz + ter (M-f"(0)+ Sx [f(0)}*~* -[1-£(0)]) 
“x1 
++ SP t itty tte +t sake weer cm 


The time used for clipping objects against each subvolume 
can be substantial on a serial processor. However, other 
algorithms to reduce the need for intersection tests 
between every ray and every surface in the scene have 
high overhead too [11]. The modeling of the clipping 
overhead for a serial processor is beyond the scope of this 
paper. | 
In our example, we will further assume that t 50 
micro-seconds and all other time measurements in the Tz 
equation are 2 micro-seconds. Also assume the number 
of light sources 1 = 4 and u + y = 0.15. Then, the total 
number of rays B = 4.57 x 10°, the average ray tracing 
time per ray Tp = 2.62 x 107 ct seconds. In a serial pro- 
cessor, the estimated total ray tracing time for an image is 
1198 seconds. The performance of our parallel architec- 
ture is estimated by developing and analyzing its queueing 
model. 


3.2 Performance Model 


Our goal of modeling is to obtain relative performance 
measures of the proposed architecture under various load- 
ing conditions. Although queueing models for dynamic 
dataflow architectures have been studied recently [8] [9], 
they are not suitable for our proposed architecture 
because of the specialized modules used in the architec- 
ture. Figure 5 shows our architecture as a network of 
queues. Service centers are the RG, ESP, FU, OMS, IP, 
and IRB. Due to different routings, we need to distin- 
guish two types of jobs in the queueing network. They 
are jobs associated with vision rays (i.e., primary ray, 
reflective ray, and transmission ray) and jobs associated 
with shadow rays. The following describes the sources 
and sinks in the model. 


Sink 1 : When a vision ray travels in space, there is a 
chance that it does not intersect any surface before exiting 
the modeled space. If this happens, no further processing 
is needed for the ray, and the background color should be 
returned to the Color Accumulator. According to our 
load model, the probability of this is 1-q. 


Sink 2 : When a shadow ray travels in space, there is a 
chance that it does not intersect any surface before reach- 
ing a light source. If this happens, the color of the light 
source should be used to compute the color of the ray- 
surface intersection point and return the color of the inter- 
section point to the Color Accumulator. According to 
our load model, the probability of this is also 1-q. 


Source 1 : The intersection computation between a vision 
ray and the multiple surfaces inside a subvolume is 
modeled by the feedback route. The feedback probability 
depends upon the average length of a Subvolume Surface 
List. For a vision ray, in order to determine the closest 
intersection point to the origin of the ray, the IP must 
perform intersection computation between the ray and 
every surface in the SSL. Due to the nature of the com- 
putation, only the vision ray enters the IRB. 


Source 2 : Intersection computations between a shadow 
ray and the multiple surfaces inside a subvolume are 
modeled by the feedback route. The feedback probability 
also depends upon the average length of a Subvolume Sur- 
face List. However, for a shadow ray, only one ray- 
surface intersection is required to determine that a light 
source does not directly illuminate a point. Therefore, on 
the average, half of the SSL is tested for intersection with 
a shadow ray before finding a blocking surface. 


Because of the large degree of parallelism available in ray 
tracing image generation, we assume that the system is 
overloaded most of the time. A throttling mechanism is 
assumed to be used among the modules to limit the 
number of jobs in the system below a threshold. This 
threshold is used as the job size in the closed queueing 
network. Other assumptions made for the performance 
model are listed as follows: 


Assumption 1 : In the model, the IP and the OMS are 
inside a feedback loop. They are assumed to be the 
system’s bottleneck. Multiple OMSs and IPs are used to 
increase the system throughput. A close-coupled IP-OMS 
arrangement is used to reduce the interconnection over- 
head between them. 


Assumption 2 : A ray will always find an intersection 
point within a non-empty subvolume. This is not true in 
general. However, it is true if the number of subvolumes 
approaches infinity. 


Assumption 3 : The service rate for each service center is 
an exponentially distributed random variable. In addi- 
tion, the queueing discipline at each service center follows 
a first come first served (FCFS) policy. 


4. Results and Discussions 


The main intention of this paper is to show a new parallel 
architecture for ray tracing and to determine the relative 
performance of the proposed architecture under different 
load conditions based on the analytical model. For this 
purpose, we have chosen some of the system parameters 
based on the current hardware technology. The service 
rates of the service centers in the queueing model are as 


221 


follows. Each IP has an average service rate of 0.1 mil- 
lion ray-surface intersections per second. This is derived 
from the data provided in [17] assuming all surfaces are 
convex polygons and specialized parallel hardware is used 
in the IP. Also, based on the above, we assume that the 
OMS can fetch 0.25 million surfaces per second. The 
IRB has an execution rate of 5 million comparisons per 
second. We also assume the ESP can retrieve empty bits 
at a rate of 10 million retrievals per second, the average 
ray generation rate is 1 million rays per second, and the 
SSL retrieval rate is 1 million SSLs per second. To esti- 
mate the performance of the system, the load model of 
the image generation task and the queueing model of the 
system are combined and analyzed by the PANACEA 
[15]. PANACEA is a software package for analyzing 
multiple job class Markovian queueing network. 


Figures 6 and 7 show how image complexity (parameter- 
ized by q) relates to the composition of the image. Fig- 
ure 8 shows the average time to travel across empty space 
in the scene and the average time to find an intersection 
point inside a subvolume as a function of the number of 
subvolumes on each axis (M). In this figure, we have 
normalized the time scale to the empty-bit retrieval time 
and assumed that the ray-surface intersection time is 
equal to 100. Three images with a different number of 
surfaces are plotted in figure 8, which shows that for a 
simple scene, a low value of M is a better choice because 
the Empty Space Processor is the system’s bottleneck. 
However, for a complex scene, a large number of subvo- 
lumes substantially improves the performance of the sys- 
tem. This is due to the fact that for a complex scene, the 
Intersection Processor becomes the system’s bottleneck. 
In this case, the higher M reduces the average length of 
Subvolume Surface List, consequently reducing the time 
to find an intersection point inside a subvolume. An 
optimal M which gives a minimum computation time 
exists for each image. 


Table 2 lists the queue length in each server and the 
system’s processing time for a vision ray and a shadow ray 
as functions of q. The results indicate that there are two 
ways that image complexity affects the performance of 
the system. A higher g increases the number of rays that 
need to be traced. Also, it increases the percentage of 
rays entering a non-empty subvolume, consequently 


increasing the number of intersection computations 
required. This causes a higher queue length at the IP- 
OMS, as shown in table 2. 


Table 3 shows the queue length of the service centers 
against the variation of the number of IPs. The shifting of 
the bottleneck from the IP-OMS to the ESP and the FU 
in table 3 is similar to the bottleneck shifting from the 
execution unit to the match unit in a dynamic dataflow 
computer. To relax the potential of being the bottleneck 
of the system, additional ESPs or FUs may be added in 
parallel. In this case,. an interconnection structure must 
be used between ESPs and FUs and between FUs and 
IP-OMSs. | 


Figure 9 shows the intersection test processing power 
(total number of busy IPs) against the number of IPs in 
the system for two different degrees of image complexity. 
It shows that the processing power approaches a constant 
when the number of IPs is large. However, more inter- 
section test processing power is usable if the work load to 
the IPs increases. Figure 10 shows the time to generate 
an image frame against the number of IPs. The result 
shows that there is an optimal number of IPs which 
require minimum time to generate an image frame. This 
optimal number of IPs increases as the image complexity 
increases. 


In summary, image complexity affects the system load in 

several ways and can cause the system’s bottleneck to 

shift. When the image complexity is high, our architec- 

ture allows several ways to improve its performance: 

1. It uses the spatial-information hierarchy for faster 
empty space transversal. 


2. It allows one to increase the subvolume resolution 
(M), consequently reducing the total number of 
intersection computations needed to produce an 
image. 


3. It uses multiple IPs and OMSs in a close-coupled 
arrangement for a higher intersection-test rate. If 
the bottleneck shifts to the ESP and FU, one may 
also add more ESPs and FUs in parallel to relax the 
bottleneck. 


5. CONCLUSION 


In this paper, we have proposed a specialized dynamic 
dataflow architecture for ray tracing image generation. 
Our architecture reduces the object storage requirement 
and increases the image generation rate especially when 
the complexity of the image is high. We have developed 
a load model for ray tracing computation based on proba- 
bility. This parallel architecture is modeled with a closed 
queuing network. The results of the load model provide 
some parameters for the queueing model. Through these 
analytical models we have learned the relative perfor- 
mance of the architecture under various load conditions. 
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Figure 2. INTERSECTION TREE FOR A PIXEL 


In ray tracing scheme, the color of a pixel is determined by an intersection tree. The root node of the 
tree corresponds to a pixel in the image. All other nodes in the intersection tree corresponds to the 
intersection points. And, arcs correspond to rays. In this figure, I, P, R, T, S and N indicate pixel, 
primary ray, reflection rays, transmission rays, shadow rays, and surface normals respectively. 


R= the ray tracing resolution of the image. 
= the number of surfaces in the scene. 
= the number of subdivisions in each axis. 
= the number of light sources in the scene. 
u= the percentage of reflective surfaces in all subvolumes. 
w= the percentage of transparent surfaces in all subvolumes. 
S= the average number of subvolumes that a surface occupies. 
B= the average surface Size coefficient. 
i.e. the proportional constant between S and M? 
p= the probability of a surface in a subvolume. 
f(x)= the probability of having x surfaces in a subvolume. 
n= the average number of surfaces in a subvolume. 
q= the probability of a ray intersecting any surface. 
g(x)= the probability of a ray traveling x—1 Empty subvolumes 
before reaching a non— Empty subvolume. 
D= the average number of Empty subvolumes traveled by a ray 
before reaching a non— Empty subvolume. 
t,= the average ray generation time (for any type of ray). 
= the average computation time To bypass an Empty subvolume. 
= average time To fetch a Subvolume Surface List. 
to= average time To fetch a surface OMS. 


= average time To compare intersection points. 
;= the average computation time for each ray— surface intersection. 
t,= the average time To determine the color value of the intersection point. 
t,= the average color summation time. 
Tp= the average computation time per ray. 
B= the total number of rays traced for generating an image. 
= the total ray tracing time for generating an image on a serial processor. 


Table 1. MATHEMATIC SYMBOLS USED IN THIS PAPER 


PROCESS TIME QUEUE LENGTH 


eocoosceeso 
e 7 e o ° ° e ° ° 
Ww nM @ WwW = 


V.RAY = VISION RAY 
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Table 2. QUEUE LENGTH AND RAY PROCESSING TIME (in micro-second) 
vs THE PROBABILITY OF A RAY INTERSECTING ANY SURFACE (q) 
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Table 3. QUEUE LENGTH vs THE NUMBER OF IP-OMS IN THE SYSTEM 
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Figure 1. RAY TRACING TERMINOLOGY USED IN THE PAPER 


Primary ray : The first ray traced for a pixel. Primary rays are created by projecting pixels in the 
viewing direction. 

Secondary ray : all reflected rays and transmitted rays. 

Vision ray : all primary rays , and secondary rays. 

Shadow ray : the ray fired from intersection point toward a light source. 

A vision ray may be terminated in the following three ways : 


1. intersecting a non-reflective and non-transparent surface, (shown in this figure). 
2. exiting the world been modeled. 
3. terminated at predefined tree depth. 


A shadow ray terminates after reaching a surface or a light source. 
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Figure 3a. A DATAFLOW GRAPH FOR THE RAY TRACING COMPUTATION 
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Figure 36. OPERATORS USED IN THE RAY TRACING COMPUTATION 
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Figure 5. PERFORMANCE MODEL OF THE ARCHITECTURE 
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Figure 4, PARALLEL ARCHITECTURE FOR RAY TRACING IMAGE GENERATION 4 ig oh 
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Figure 10. IMAGE GENERATION TIME vs NUMBER OF IP-OMS 
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Abstract 


The static data flow model of computation offers high perfor- 
mance and scalability by exploiting fine-grained parallelism 
and flow control constrained only by data precedence. Unfor- 
tunately, the token driven mechanism upon which most 
proposed static data flow architectures are based is inefficient 
for communication and synchronization, being profligate in its 
use of memory bandwidth and micro-operations. Associative 
templates have been proposed as a temporally efficient alterna- 
tive to tokens. This approach applies associative processing 
methods to data flow communication and synchronization. This 
paper presents a practical associative template based architec- 
ture that provides effective, fine-grained Static data flow 
computation. 


1. Introduction 


Efficient techniques for managing fine-grained parallelism are pre- 
requisite to very high performance computation. In the static data 
flow model [1, 2] of computation, program execution is typically 
coordinated by tokens [3], directed packets providing communication 
and synchronization among operation templates. Token-based data 
flow architectures [4, 5] have memory bandwidth requirements (the 
number of memory accesses per operation) and serial temporal over- 
head (the number of primitive micro-operations that must be 
performed sequentially per instruction execution) greatly in excess ‘of 
equivalent characteristics for conventional uniprocessors performing 
the same tasks. The present dearth of data flow machines in the high 
performance computing arena is due, in part, to the intrinsic inefficien- 
cy of the token driven approach. Viable static data flow awaits an 
alternate execution mechanism. 


The associative template [6] mechanism evolved from the need to 
significantly reduce overhead and make better use of communication 
bandwidth between integrated circuits. It employs associative process- 
ing methods to perform program flow control in a data flow 
context. The associative template mechanism fully satisfies the seman- 
tic requirements of the static data flow model, and has previously 
been described [7] in the context of a single-node system. 


Associative diffusion, a related associative processing technique 
that extends the associative template approach to multi-node systems, 
is a new method for supporting communication between adjacent nodes 
in a mesh interconnected topology. For this specific class of system 
structures, associative diffusion provides nearest neighbor communica- 
tion without tokens, and without incurring an additional time 
penalty. It does so by overlapping the domains of associativity across 
adjacent node boundaries. . There is an overhead cost in time for com- 
munication between nonadjacent nodes, so this method is best suited 
for those algorithms that can be statically mapped onto the node array 
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to require mostly nearest neithbor' transactions. 


Together, the associative template and associative diffusion con- 
cepts establish an alternate approach to static data flow architecture. 
By eliminating tokens, static data flow architectures of much greater 
efficiency can be realized. However, the unsophisticated implementa- 
tion of these associative methods can result in specifications that 
exceed the capability of today’s technology in terms of packaging, 
power dissipation, and electrical characteristics. 


This paper presents a new static data flow architecture based on 
associative templates and associative diffusion. The Associative Tem- 
plate Dataflow (ATD) computer architecture separates the 
synchronization control and data communication functions of the asso- 
ciative template and associative diffusion mechanisms, providing the 
low-overhead communication between neighbors at little additional 
cost, and resulting in a system organization whose components can be 
realized with current technology. 


For certain classes of storage allocation and program structure the 
architecture permits maximum throughput of critical components. 
This is achieved by decoupling the synchronization of modules into an 
ensemble of asynchronously interacting client and server components, 
maximizing throughput utilization across the interfaces to critical 
(expensive, performance constraining) system elements, and minimiz- 
ing the number of such transactions that must be performed per 
operation execution (template firing). 


In the following sections, the associative template and associative 
diffusion concepts are reviewed in their generalized form, and a sim- 
plistic single node architecture is described to demonstrate that the 
semantic criteria of the static data flow computing model are met. 
The ATD architecture embodies associative templates and associative 
diffusion in a practical implementation. The architecture is functional- 
ly decomposed, the modules are described in detail, and some of the 
variations and trade-offs are considered. The concluding discussion 
focuses on unresolved problems of hardware implementation and exe- 
cution of real world applications. 


2. Background 


2.1 The Static Data Flow Model 


The static data flow model of computation is a set of semantic 
policies that must be supported by the underlying execution medium. 
Foremost among these are: 


1) A data flow operator will execute when its operands 
have been computed by its argument source operators 


Tin the class of interconnection schemes discussed here, nearest neighbor transactions are 
always of distance one. 


A B (data driven synchroniza- 


J tion). 


2) Only a single instantia- 
tion of each operator at a 
time is permitted. An 
operator will execute 
only when the operators 
that use its result values: 
(the recipients) have com- 
pleted their most recent 
execution and are pre- 
pared to accept new 
operand values. 


SELECT 


Each operator in a data flow program 
Figure 1. The Select and is represented by a template, a small data 
Switch operators. The| Structure that explicitly or implicitly 
Select operator passes} Specifies the operation to be performed; 
either its A or B operand} the source templates that supply the argu- 
based on the value of the! ment values, the recipient templates that 


‘third . i ts 
ne aegis pd ou use the result values as operands, and the 


arrival of the operand not} 'ansient internal state of the operator as 
selected. The Switch opera-| it progresses through its execution cycle. 
tor passes its A operand to} Conditional flow control is supported by 


one of two groups of recipi-| snecial templates 
ents, based on the value of Z. ai J 


The select template 
chooses as its result value one of two 
argument values depending on the Boolean value of its third argu- 
ment. The switch template makes its primary input datum available 
to one of two sets of prescribed recipient templates depending on the 
value of its other Boolean operand (see Figure 1). 


In most proposed static data flow architectures the assumed imple- 
mentation mechanism is the token, a small message packet for 
synchronization and communication among templates. In such architec- 
tures each operator execution involves a number of token handling 
micro-operations and a number of memory accesses to the template 
store, resulting in temporal overhead and memory bandwidth require- 
ments almost an order of magnitude greater than that of executing the 
same operation on a conventional RISC microprocessor [8]. 


2.2 Associative Processing for Flow Control 


The conventional application of associative processing [9] has been 
the searching of sets of data to detect records that contain a field 
matching a specified key value. Associative mechanisms have been 
applied to database operations for sorting and searching [10], to cache 
memories [11, 12] for fast instruction fetching and data access, and to 
translation lookaside buffers [13] for rapid virtual memory mapping. 
The associative template mechanism applies associative processing [9 - 
13] to program execution flow control. Instead of managing program 
data as in the case of the previously cited applications associative tem- 
plates use associative methods to directly modify the control state of 
an executing data flow computer. 


Associative techniques are used to manage data flow control 
because the control state of the system is distributed among the tem- 
plates: the generation of a new result value can affect elements of the 
control state in different parts of the system. In a token driven sys- 
tem, each portion of the control state change invoked by an executing 
template is performed as a distinct token handling operation. By 
employing broadcast communication techniques, the knowledge of a 
template firing event can be distributed to all necessary parts of the 
control state at the same time. Templates recognize relevant broad- 
cast events and adjust their own part of the program control state. 
This includes both reporting availability of new operands to recipient 
templates (templates for which result values are destined), and updat- 
ing the source templates (templates from which argument values are 
derived) with the acknowledge status of their recipients. All of these 


actions can be done simultaneously, 
but require memories configured 
with internal logic for the purpose. 


2.3 Associative Diffusion for Com- 
munication 


The initial associative template 
architecture was developed for a 
system consisting of a single pro- 
cessing element, or node. The 
architecture exploits parallelism to 
sustain peak performance of a 
pipelined functional arithmetic 
unit. The concept of associative dif- Figure 2. Domains of associativi- 


: : ty. The gray node monitors the tem- 
fusion <Was devised to extend plate activity of the two black nodes 
associative 


mechanisms lo the | and the white node in the gray cir- 
important but still restrictive case} cle. The white node references val- 


of communication between adjacent | ves produced by nodes in the out- 
nodes of a multi-node associative | line circle. The domains for the 


template system. Unfortunately, | black nodes are i shown bases 
the cost in transistors and package pin count of a direct application of 


the associative diffusion concept to the implementation of a static data 
flow computer incorporating associative templates would be pro- 
hibitive. 


Associative diffusion extends the domain of associativity of a 
node across node boundaries to encompass the operation space of its 
nearest neighbors. Every template in a node monitors the operation of 
the other templates contained within its local node and in its adjacent 
nodes (see Figure 2), and watches the result values produced by tem- 
plates of its own and its neighboring nodes. In this way, a template 
can reference result values of other nodes and determine when they 
become available. Templates can reference arguments across node 
boundaries and respond to requests for result values from other 
nodes. A template monitoring argument references from executing 
templates of its neighboring nodes as well as those of its own node 
can ascertain the acknowledge status (for data flow synchronization) 
of recipient templates in its local node and in the neighboring nodes. 


Associative diffusion supports nearest neighbor communication 
that is as fast as communication between templates in the same node. 
Since that efficiency does not extend beyond adjacent nodes, it best 
suits applications whose locality properties can be easily mapped onto 
mesh-like system organizations. Although this is somewhat restric- 
tive, many important classes of scientific computation exhibit such 
behavior. Even for applications where some longer distance transac- 
tions are required, multiple hop transfers can be supported within the 
scope of this mechanism. Doing so, however, will impact latency 
time and degrade system performance to a degree proportional to the 
average communication distance. 


2.4 Advantages and Disadvantages 


The associative template mechanism circumvents much of the bur- 
densome overhead in time and space imposed by the token mechanism. 
Some of the specific advantages include reduced memory bandwidth, 
higher throughput, free acknowledgments, smaller memory, and queue 
elimination. 


With these gains comes the inconvenience of requiring custom inte- 
grated circuits for all of the principal elements. The template storage 
module, which for a token driven system is conventional RAM, must 
be implemented as a very smart memory. The templates require at 
least six comparators each, along with other custom logic, even in a 
single node system. 


This paper presents an architecture that, while employing the asso- 
ciative template and associative diffusion concepts to eliminate the: 
need for token mechanisms, assumes a structure that is within the capa- 


bility of current engineering practices. A direct 
application of the previously described mecha- 
nisms to a system architecture presents serious 
difficulties in implementation. The number of 
transistors and pins required for system node-to- 
node interfacing would appear to be prohibitive. 
Additional problems of bus loading and power 
consumption strain the system’s feasibility. The 
machine described below captures the strengths 
of the associative diffusion mechanism without 
the expected prohibitive costs. The result is a 
practical architecture for tokenless static data 
flow computation. 


3. A Single Node Architecture 


A simple, single-node associative template 
data flow machine has one high throughput, 
pipelined functional computation unit (FCU), functional in the sense 
that no internal state is kept between operations. The FCU accepts 
operation packets containing an opcode, the necessary operand values, 
and a result address, and produces packets containing the value result- 
ing from the computation performed and the result address. The FCU 
is driven by the graph coordinator, which stores and manages execu- 
tion of the static data flow program. The graph coordinator embodies 
the associative template techniques. It fires a new template every 
cycle, delivering an operation packet to the FCU for processing. At 
the same time, it assimilates a result from the FCU every cycle and 
updates the control state of all recipient templates. 


The graph coordinator is a collection of associative templates and 
the logic that controls their execution (dispatching logic). A tem- 
plate for a dyadic operator has five entries: a status field, two 
operand source addresses, an opcode, and a result value buffer. It 
receives results computed by the FCU on the result channel, and pass- 
es more work to the FCU on the operation channel (see Figures 3 and 
4). When a template fires, its address and the contents of its opcode 
field are written to the operation channel, and its operand address 
fields cause the source templates to write their result value fields to 
the operation channel as shown in Figure 4. 


A section of logic having access to every template’s status flags, 
called the dispatching logic, determines when a template can fire. A 
template that is eligible to fire, called pending, has its A and B flags 
set, indicating that its operands have been computed and are available 
for access; and has its acknowledge flag set, indicating that its current 
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Figure 3. An associative template. The triangles are comparators that monitor the various busses for 
addresses that match the template’s operand address fields. A result address match will set the A or B 
operand arrival flag. An operand bus address match will cause the template to generate an acknowledge sta- 
tus signal that will be sampled by the operand’s source template. When the template fires, it places its opcode 
onto the operation channel and its operand addresses onto the operand address busses. Later, it receives 
its new value on the result channel. When it is a source, it places its result value onto the operation channel. 


result value is no longer needed by its recipients. The dispatching log- 
ic will select one of the pending templates for execution. 


The A or B flag of a template is set when the result channel 
address matches the template’s A or B address field, respectively, 
which occurs when the FCU produces a needed operand value and 
returns it to the graph coordinator. The acknowledge flag, which per- 
forms the static data flow acknowledge synchronization function, 
stores the negated current value of the wired-OR acknowledgment sig- 
nal whenever the template’s result value is accessed by a recipient. 
The acknowledgment signal is asserted by other recipients of the tem- 
plate’s result value that have not yet fired, signifying that the result 
value is still needed for future computation. These recipients are iden- 
tified by matching the contents of either of their source address fields 
with the contents of either of the operation channel’s operand address 
busses. 


The graph coordinator accepts a result value and produces an opera- 
tion packet during each cycle of its interface. During the cycle, all. 
templates that are recipients of the computed result update their con- 
trol state simultaneously to reflect its availability. At the same time 
that the operation packet is being assembled, the acknowledge status 
for both referenced operand source templates is derived. This wealth 
of very low level parallelism in the operation of the associative tem- 
plate mechanisms is responsible for its high throughput and interface 
bandwidth efficiency. 


4. The ATD Architecture 


The ATD architecture is a practical static 
data flow architecture embodying the associa- 
tive template and _ associative diffusion 
concepts. This section describes the ATD 
architecture in detail, building on the back- 
ground material and presentation of the 
simple single node system. 


4.1 System Level Architecture 


The ATD architecture extends the single 
node architecture to support multiple, inter- 
connected nodes while retaining _ the 
efficiencies derived from its associative prop- 
erties. Very tight coupling of the graph 
coordinator and the FCU is essential for high 


Figure 4. A template firing. When a template’s operands have become available and its current result value is performance. The associative domain is aug-' 
no longer needed, it can fire when selected by the dispatching logic. The firing sequence causes the opcode | mented to include the nearest neighbors 


and the master template’s address to be placed directly onto the operation channel. 
addresses cause the source templates to also place their values onto the operation channel. The completed 


operation packet is submitted to the FCU. 
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The A and B operand | without an overwhelming increase in cost or 
complexity. 


Figure 5. Mesh topologies. Nodes of degree 3 can be used to create mesh 
Structures. In a) degree 3 nodes are combined to form a degree 4 mesh build- 
ing block. In b) the nodes form a degree 6 building block. 


Some simplifying assumptions are imposed on this architecture to 
reduce complexity of its nodes while assuring system scalability. The 
system level architecture is chosen to be a large mesh [14] with effi- 
cient adjacent node communication. Longer distance transactions may 
experience proportionally longer latencies and performance degrada- 
tion. Exploiting locality to eliminate or severely limit long distance 
transactions opens the way to vast systems comprising millions of 
nodes without a reduction in the average useful throughput per node, 
assuming sufficient problem size and node reliability. Many problems 
of interest, such as systolic algorithms [15], finite element methods 
[16, 17], and signal processing applications [18] have communication 
patterns that can be statically mapped onto such a structure. 


The mesh system organization can be realized with nodes of only 
degree 3 (nodes with three external interface ports). The relatively 
small number of ports for each node is important in constraining the 
module’s complexity while achieving the essence of the associative dif- 
fusion mechanism. Two examples of mesh topologies that can be 
supported with degree three nodes are shown in Figure 5. Cube-con- 
nected cycles [19] can also be implemented with degree three nodes. 


The system level architecture employs a non-global address space. 
This is acceptable under the assumption that almost all references are 
either intra-nodal or to adjacent 
nodes. For those cases where multi- 
ple hop communication is necessary, RESULT 
forwarding templates can be ADDRESSES 
employed at the proportional cost of | g 
template storage space and node| LSB OF VALUE 
throughput. : 


4.2 Node Architecture 


The architecture of the ATD 
node comprises a small number of 
specially devised elements that} OPERAND 
implement the associative template | DATA VALUES 
and associative diffusion mecha- 
nisms. These elements operate in a 
manner similar to that of the simpli- 
sas architecture described earlier, ADDRESSES 
ut also support communication AND 
with three nearest neighbors. The | ACKNOWLEDGE 
elements are 1) the inter-node inter- | STATUS 
face, 2) the data store, 3) the 
functional computation unit, 4) the 
operation packet builder, 5) the 
graph coordinator, and 6) the opcode 
store. The structure of a node is 
shown in Figure 6. 


OPERAND 


INTER-NODE 
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aa OPERATION 
|______.| PACKET BUILDER 


putation unit that is only responsible for processing operation packets 
generated within the node. A dominant specification of the architec- 
ture is to sustain peak throughput of this unit. Its design is matched 
to the performance of other units that determine the rate of operation 
packet generation. 7 


The fields comprising a data flow template are distributed among 
three of the node elements. The parts of the template that specify the 
flow graph topology and program control state, namely the operand 
addresses and the status flags, are stored in the graph coordinator. The 


operation code for each template is located in the node’s opcode store. 


The result values produced by the templates are held in the node’s mul- 
tiport data memory, the data store. 


The opcode store is a simple small memory with the number of 
entries equal to the number of templates managed by the node and 
wide enough to hold the number of bits necessary to distinguish 
among the set of operations performed by the functional computation 
unit. 


The data store stores the results computed by the FCU and pro- 
vides the result values to the operation packet builder of the local and 
adjacent nodes upon request. It contains two two-port memories, each 
with separate read and write ports. The duplicate memories increase 
the throughput of the data store. The result values of the functional 
computation unit are stored in both memories simultaneously via their 
write ports. Each data memory services four sources of read requests, 
one from its own node and one from each of the connected neighboring 
nodes. These requests come from the graph coordinators of each of the 
nodes. The data values are returned to the operation packet builder of 
the node originating the access. 


The operation packet builder constructs operation packets for 
delivery to the FCU. It acquires the source template address directly 
from the graph coordinator. The opcode of the operation to be per- 
formed is provided by the opcode store. The argument values come 
from the data memories of either the host node or the neighboring 
node holding the result value. 


The graph coordinator stores the data flow graph topology, main- 
tains the program control state, and determines the order in which the 
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Figure 6. A node of the ATD architecture. Additional communication channels exchange template addresses and data 


Every node has a functional com- | values with neighboring nodes. 
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Figure 7. A node and its inter-node connection to one neighbor. Each 
neighbor connects in the same manner, with the full complement of signals 
being exchanged. 


program templates fire. 


It receives addresses of fired templates from 
the functional computation units of its own and its three neighboring 


nodes. It also gets acknowledge synchronization signals from the 
graph coordinators of all four nodes. The means by which these two 
classes of event information are used to update the program control 
state approximates the domain of associative diffusion for the host 
node and its three adjacent neighbors. The graph coordinator generates 
a new three-tuple each cycle specifying the address of the executing 
template and the addresses of the source templates for its arguments. 
For each of the two argument template addresses, the graph coordina- 
tor provides a single bit acknowledge signal indicating whether or not 
it contains other templates that still require that operand to be avail- 


able. A key architectural objective is to maximize graph coordinator 


utilization. 


4.3 The Inter-Node Interface 


Adjacent nodes are connected by means of a symmetrical interface 
as shown in Figure 7. There are 
four groups of interface signals 
between two neighbor nodes, A and 
B. The first group supports data 
driven synchronization. The result 
address of operations performed by 
the functional computation unit of 
one node are sent to the result bus 
of the other node’s graph coordina- 
tor, indicating the availability of 
the result values for the identified 
templates. 


The next two groups support 
data access service requests from the 
nodes’ data memories. The service 
requests originate with the node’s 
graph coordinator. The data values 
are returned to the requesting node’s 
operation packet builder. 


OPERAND ADDRESS > 
OPERAND VALUE <- 


The last group of interface sig- 
nals provides acknowledge 
synchronization information 
between adjacent nodes. Again, two 
complete paths are provided in each 
direction. The acknowledge infor- 
mation consists of the acknowledge 
condition state and the template 
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receiving the acknowledge signal. This information comes from the 
operation channel of the sending node’s graph coordinator and is des- 
tined for one of two of the acknowledge ports of the receiving node’s 
graph coordinator. This information is used to set the state of the 
acknowledge flags for the selected template. 


4.4 The Data Store 


The data store (see Figure 8) resolves operand addresses into 
operand values. It accepts addresses from each domain member (node), 
and can return values to any member. The particular method used to 
accomplish the diffusion of result values across the domain is embod- 
ied in the choice of implementation of the data store. One 
configuration is described here. 


The configuration shown in Figure 8 has two banks. Results of 
computation are stored in the data memory, duplicated in each bank. 
This structure, allowing two operands to be resolved simultaneously, 
increases the availability of the operand data. 


An address register latches the incoming operand address until it 
can be resolved by the data memory. Arbitration logic selects one of 
the waiting addresses, which is then used to retrieve the operand value 
from the data memory. The selected address is also supplied, accompa- 
nied by the acknowledge status to the graph coordinator. 


4.5 The Operation Packet Builder 


Because the delay between the initiation of the execution sequence 
of a template (when the graph coordinator selects an eligible template 
from those that are pending) and the arrival of its operand values can 
vary, and since there are multiple (in this case, four) sources of 
operand values, greater throughput can be achieved by allowing the 
template firing logic to start building operation packets for several 
templates at once. The short latency incurred by the operation packet 
builder does not degrade system performance as long as there are. 
enough templates continuously ready to keep the FCU pipeline full. 
This creates the need for a small buffer area where the opcode and 
result address are temporarily stored while the operand addresses are 
being resolved. When the operand data values are obtained, the com- 
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Figure 8. The data store. The data store has two data memories containing duplicate sets of result values, and two access 
ports, one to each memory, for each node in the domain. As operands are fetched, the accompanying acknowledge status 
is forwarded to the graph coordinator. New result values are stored as they arrive from the FCU. 
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pleted packet is submitted to the FCU. The operation packet builder 
(see Figure 9) accomplishes those tasks. 


The operation packet builder accepts an opcode, a result address, 
and, for each operand, a domain/bank address (three bits specifying the 
neighbor and which of the two ports). These it places in one of its 
buffers. The domain/bank address is used to route the arriving operand 
value to the proper operation packet buffer field. Once an operand val- 
ue has arrived and has been stored, the operand address register used 


for the access is free again to be used by another firing template. The 


status of the operand address registers is used by the graph coordina- 
tor, which chooses a pending template that can use currently available 
registers. 


4.6 The Functional Computation Unit 


Operations on the operand data values, which include arithmetic 
and Boolean manipulations, are performed by the functional computa- 
tion unit (FCU). The FCU accepts a stream of operation packets from 
‘the operation packet builder, each containing an opcode, two operand 
values, and a result address (see Figure 10). Since the FCU is purely 
functional, the result of any operation will be the same independent of 
the ordering of the arrival of operation packets. The actual internal 
FCU architecture is not discussed here, but can be assumed to employ 
pipelining to increase its throughput, and one or more VLSI floating 
point or special purpose functional units. 


4.7 The Graph Coordinator 


The ATD graph coordinator extends the simple single node archi- 
tecture presented in Section 2 to provide associative diffusion for 
associative template operation between adjacent nodes. It provides 
data driven synchronization from neighbor nodes, acknowledgment 
reporting by neighbors, and template firing based on available 
resources. These extensions are presented in this section. The new 
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Figure 9. The operation packet builder. The 
Operation packet builder begins building a packet 
when it receives the domain and bank addresses 
of the operands and the result template address. 
, It will later receive the opcode for the packet and 
the two operand values, which may arrive at sepa- 
rate times. When all the fields have been filled, 
the completed packet is sent to the FCU. 


graph coordinator architecture is designed to yield chip, pin, and device 
counts considered practical by standards of contemporary technology. 


4.7.1 Result Value Handling 


As previously indicated, result values are no longer stored in the 
graph coordinator. Instead, a separate dedicated multiport data store 
is provided. This transfer of functionality reduces the requirements 
imposed on the graph coordinator, drastically lowering its interface 
pin count, permitting more uniform chip layout, and concentrating 
available on-chip devices on the task of flow control. While the 
result data of the FCUs are not applied to the graph coordinator, the 
result addresses that identify the source templates of the operations 
are still supplied. This is necessary for the unit to synchronize on 
completed operations and update its control state. Furthermore, for 
the distributed architecture, (as opposed to the single node system) 
result addresses from adjacent nodes as well as those from the host 
node’s FCU must be monitored associatively. 


In the single node system, each template argument source address 
field employs a field-wide comparator to monitor the result address 
bus. Associative diffusion requires that the result addresses of all 
four nodes in an associative domain be equally accessible, and thus four 
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Figure 10. The functional computation unit. The functional computation 
unit receives a stream of operation packets and produces a stream of result val- 
ues accompanied by their destination addresses. 


result address busses, one from each node’s functional computation 
unit, are used with the ATD graph coordinator. Address widths are 
expected to be in the range of twelve bits, which would require forty- 
eight input pins. A few additional pins are needed for conditional tem- 
plates and timing. 


The direct realization of associative diffusion would imply that 
the single source address field comparator should be expanded to four 
comparators, one to monitor each of the four result address busses. 
Fortunately, the computing model is constrained, so this is not neces- 
sary. Because of the static allocation of data flow templates, the 
argument template referenced by a source address field can come from 
only one of the four domain nodes. Therefore each source address field 
need monitor only one of the four result address busses. The bus to be 
monitored is specified by the two most significant bits of the tem- 
plate address. Instead of adding three more comparators, the graph 
coordinator is augmented with a 4 to 1 multiplexer with input selec- 
tion controlled by the two most significant address bits (the domain 
bits), as shown in Figure 11. This approach is far less expensive than 
the four comparator approach in terms of both transistor count and 
power consumption. 


4.7.2 Synchronization by Acknowledgment 


The single node associative template method of synchronizing 
with recipient (children) templates, referred to here as acknowledge 
synchronization, can be thought of as consisting of two parts: generat- 


ing the acknowledge condition state, and recording the current state in 


the appropriate template’s acknowledge flag. When all templates 
were in the same node, both parts could be performed simultaneously. 
In the ATD machine, recipient templates may be located in any one of 
a template’s domain nodes. The direct method of implementing asso- 
ciative diffusion would be to tie all nodes of a domain into one large 
node. To maintain parallelism, there would have to be four times as 


Figure 11. An ATD associative template. The acknowl- 
edge flag of the single node system has been replaced by 
an expected flag and a received flag for each neighbor. 


many operation channel address busses and as many additional compara- 
tors per source address field. The costs of such a structure are clearly 
prohibitive. 


The ATD architecture approximates this structure by separately 
handling the acknowledge synchronization of a node for each of the 
nodes in its domain. When a node creates an acknowledge condition 
signal, it needs only to go to one of its four domain nodes, that being 
the same node to which the argument data request is directed. Instead 
of broadcasting the condition signal across the entire domain, it is sent 
only to the node in which the argument template resides. All tem- 
plates in one node that use the results of a template in another node of 
their domain participate associatively as in the single node architecture 
to determine whether all of them are done with that operand. An 
acknowledge condition signal received from another node indicates 
whether the entire node is finished with the operand. Thus the first 
part, that of creating an acknowledge condition signal for a template, 
is done on a per node basis. 


The second part, that of recording the acknowledge condition 
state, is facilitated by replacing the original acknowledge flag with 
four flags and four mask bits (see Figure 12). Each flag reflects the 
acknowledge condition state of one of the four nodes in the domain for 
the template’s result value. If a particular neighboring node contains 
no resident templates that use the result of a host’s template, then the 
mask bit corresponding to that neighbor node is set. A template’s 
acknowledge status is satisfied when either the flag or its mask bit is 
set for all of the domain nodes. 


To set the flags, the graph coordinator is augmented with two 
acknowledge select busses. Each of these busses can independently 
choose a template and load the state of one of the acknowledge flags. 
One of these busses is associated with each of the node’s two data 
store memories, and the current acknowledge status is stored in the 
appropriate flag at the same time that the data value for that template 
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Figure 12. Firing logic. The dispatching logic uses the acknowledge flags 
and the operand available flags to determine whether a template is eligible for fir- 
ing. 


is being fetched from the data memory. The additional acknowledge 
busses increase the pin count by twenty-eight. Since these are common 
select busses, no additional comparators are required to extend the 
utility of the graph coordinator from single node operation to associa- 
tive diffusion emulation. 


4.7.3 Dispatching 


The dispatching logic in the single node architecture served the 
simple task of choosing almost arbitrarily among the pending tem- 
-plates ready for execution. The ATD architecture relies heavily on the 
dispatching logic for a second critical function, that of resource man- 
agement. The potential for bottlenecks exists in the ATD architecture 
because of the possibility of contention for access to the data memo- 
ries shared among the four nodes of a domain. It is the job of the 
dispatching logic to prevent such contention from degrading the 
throughput of the graph coordinator and indirectly the throughput of 
the FCU. | 


A node has two interface ports to the data memory of each of its 
domain’s nodes. Each port can only support one data access request 
from a memory at a time. The dispatching logic receives empty/full 
signals from all of the ports. The set of full ports restricts the cate- 
gories of templates (based on the source nodes of the arguments) from 
which the next one to fire may be chosen. The dispatching logic con- 
tinues to select templates as long as ports are available to carry out 
the argument access requests, and as long as templates that make use 
of available data store ports are pending. 


4.8 Template Execution 


To summarize the operation of the ATD architecture the execution 
cycle of an associative template is examined. Assume this template 
gets its A operand from a local source template and its B operand 
from a source template in a neighboring node (Figure 13). Also 
assume that its result values are used by two local recipient templates 
and one recipient template in a neighboring node. 


The cycle begins with the template ready to fire. The dispatching 
logic of the graph coordinator selects the pending template for execu- 
tion when a port to the local data memory and a port to the node 
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containing its B operand are available, then sends the address of the fir- 
ing template to the opcode store and to the operation packet builder. 

The firing template asserts the contents of its two source argument 

address fields on the A and B operand address busses within the graph 

coordinator. The A operand address is applied to the access port of 

one of the two data memories in the local node. The B operand 

address is applied to the access port of one of the two data memories” 
in the neighbor node containing the B source template. 


An acknowledge condition state signal accompanies each of the 
two operand access service requests. Other templates in the local node 
monitor the operation channel and, if either of their argument source 
template fields match either of the source template addresses on the 
operation channel, they assert an active signal on the appropriate wired- 
OR acknowledge signal line indicating that they still require the 
operand to be available. Otherwise, the templates output an inactive 
signal on the acknowledge lines. These signals tell the source tem- 
plates whether or not there are other templates in that node for which 
the result value must remain available. The firing template’s parame- 
ters are distributed to the designated data stores, opcode store, and 
packet builder, and the acknowledge status for each of the arguments 
is produced. The argument and acknowledge flags of the firing tem- 
plate are then reset. 


The operation packet builder chooses a free operation packet buffer 
and records from which data memory output ports the two argument 
values are to come. It immediately stores the firing template’s identi- 
fying address, which is also applied to the opcode store. The 
operation’s opcode is provided during the second cycle and is loaded 
into the operation packet buffer. 


The data store for the neighbor node supplying the B operand arbi- 
trates in a round-robin fashion among the four access ports (from the 
adjacent nodes) it services. When it comes to the port for the firing 
template, the data memory reads the contents of its addressed value 
and retums it to the dedicated output buffer of the local node. This 
output buffer directly feeds the local operation packet builder. At the 
same time, the acknowledge condition state accompanying the argu- 
ment value request is passed to the graph coordinator’s acknowledge 
port along with the source template address. The address selects the 
source template, and the acknowledge condition state is loaded into 
the acknowledge flag associated with the local node. 


The operation packet builder continuously acquires the contents of 
data memory output buffers to which values have been written and 
stores them in the appropriate fields of the designated operation packet 
buffers. When all components of the operation packet have been 
assembled in the buffer, the buffer’s ready flag is set. Shortly there- 
after, the functional computation unit detects the ready condition of 
the operation packet in the buffer and assimilates its contents. 
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Figure 13. Template 
Execution. The firing 
template interacts with 
other templates in its 
own node and in a 
neighboring node. 


The functional computation unit processes the operation packet. 
After some number of cycles, due to the latency of the unit, (which is 
unspecified by the architecture,) the result value of the operation and 
the address of the template responsible for its creation are produced. 
The result value is immediately stored in both halves of the node’s 
data memory via their respective write ports. The constraints of the 
data flow model guarantee that no location of the data memory will 
be both written and read at the same time, so conflict cannot occur. 
The result address is distributed to the graph coordinator result busses 
of each of the nodes within the domain. 


The template source address fields monitor the result busses of 
the nodes from which their operand values are derived. The A source 
address field monitors the local result bus with its comparators con- 
nected to the bus through its multiplexor set by its address’ two most 
significant bits. The B source address field similarly monitors the 
result bus of the neighboring node containing the template referenced 
by the field. When these source templates fire (A locally and B in the 
neighboring node) the template determines the availability of their 
result values by detecting a match between the fields’ contents and 
those of the respective result busses. Upon this occasion, the appropri- 
ate argument flags, A or B, are set. 


As the template’s recipient templates fire, they access the node’s 
data memory for the template’s result value and return acknowledge 
condition state signals to the acknowledge ports of the graph coordina- 
tor. When the recipient template in the neighboring node fires, there 
are no other templates using the local template’s result value as 
operands, so the acknowledge condition state returned to the local 
graph coordinator causes the corresponding acknowledge condition flag 
to be set. When the first of the two local recipients fires, the 
acknowledge flag will remain clear because the second local recipient 
still requires the template’s result value to be available. Upon firing 
of this second local recipient, however, the local acknowledge flag is 
set because no other templates in the local node require the result val- 
ue to perform their own operations. 


Both the A and B flags are set indicating that both operands are 
available. The local acknowledge flag and that associated with one of 
the neighbor nodes (the one containing the recipient template) are set 
while the masks of the other two acknowledge flags are set because 
those adjacent nodes do not contain any recipients of the template. 
Under these conditions, the dispatch logic determines that the tem- 
plate is again ready to fire, thus completing the execution cycle. 


5. Conclusions 


A new architecture for static data flow computation has been pre- 
sented that employs associative mechanisms for program flow control 
and communication in lieu of more conventional token driven tech- 
niques. Tokens impose too much overhead for effective fine-grained 
parallel processing and are wasteful of memory bandwidth. It has 
been shown that by using associative techniques, associative templates 
support the semantics of static data flow more efficiently than do 
tokens. That concept alone, however, has been inadequate to formulate 
a complete distributed static data flow architecture. It has not provid- 
ed the means by which interaction among multiple nodes is conducted. 
A second concept, that of associative diffusion, had also been put for- 
ward to fill this gap. It proposed that the domains of associativity of 
nearest neighbor processing elements, or nodes, be overlapped so that 
the activities of one node could be directly monitored by its immediate 
neighbors. This provides the vehicle for extending the associative tem- 
plate mechanisms across node boundaries. While feasible methods of 
implementing a single node system with associative templates exist, 
the direct method of extending that architecture with associative diffu- 
sion requires a prohibitive amount of logic. The new architecture 
presented in this paper provides the first practical means without 
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tokens of realizing an associative template static data flow computer 
with an approximation of associative diffusion for synchronization and 
communication. 


The ATD architecture exhibits substantial promise for high perfor- 
mance parallel computing in general, and static data flow computation 
in particular. But a number of questions still remain to be investigat- 
ed before it can be proved worthy of implementation. One is the logic 
intensity of the control unit. The architecture requires an interface to 
this element that is entirely realizable. However, the circuitry 
required per template is substantial. Preliminary designs have estab- 
lished that approximately a thousand transistors are required per 
template in the control unit. While the layout structure is particular- 
ly orderly promising good utilization of chip real estate and simple 
design, this is still a lot of logic. Current technology can thus pro- 
duce control units capable of containing about 256 such templates. 
The ATD architecture supports connecting these units in groups of 
four to permit a thousand templates per node. Before judgement can 
be made regarding its acceptability, this cost must be weighed against 
alternate methods of applying that scale of logic to parallel comput- 
ing. 


A second challenge that must be satisfactorily met is the means 
by which programs are distributed among the nodes in the assumed 
mesh system level structure. While a number of classes of problems 
are known to be easily mapped onto such a structure to maximize near- 
est node communication, the degree to which intra-node program 
parallelism is needed to fill the latency cycles of the memory access 
paths must be studied and automated allocation techniques must be 
developed. 


Finally, a range of data memory/packet builder structures present 
themselves. How performance varies for these different structures 
with respect to real world application programs has yet to be under- 
stood and needs to be explored. At this point, the success of the ATD 
architecture is that it demonstrates a complete and viable alternate 
approach to the token mechanism for static data flow architecture and 
opens a new area of performance/cost trade-offs in the design of data 
flow computers. It is hoped that this work, as preliminary as it is, 
will inspire other researchers in this field to reexamine the data flow 
computing model in light of these new structures and to consider the 
potential of their advancement. 
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Abstract -- The SAM architecture is an enhanced von Neumann 
processor that contains inexpensive features for supporting data 
flow style of parallelism. The architecture gets is name from 
the basic instructions for supporting parallelism, Split and 
Merge. It is shown that these instructions can be used to 
implement the parallel structure of an arbitrary acyclic data flow 
graph. Features for supporting dynamic parallelism and 
multiple run-time environments are presented. Implementation 
issues for supporting instruction execution and the handling of 
faults and interrupts ar also discussed. 


1. Introduction. 


One of the main focuses of current research in computer 
architecture is the design of hardware organizations that support 
the parallel execution of instructions (see [1] for several 
examples.). Data flow parallel architectures continue to receive 
a great deal of attention [3] [4]. In a data flow architecture an 
instruction may execute as soon as its operands become 
available, permiting a degree of parallelism bounded only by 
the flow of data between instructions. In spite of their intuitive 
appeal, data flow machines have been slow to reach the 
marketplace, and it appears that much work must be done to 
make data flow machines competitive with other parallel 
architectures[5]. 


In spite of the objections raised by [5], data flow is appealing, 
and it is reasonable to ask whether it can be adapted to a more 
conventional architecture. The approach taken in this paper is 
to start with a von Neumann processor, and by adding features, 


enable it to execute programs in the highly parallel manner 


characteristic of data flow machines. The objective is to 
develop inexpensive parallel architectures that can exploit 
parallelism without sacrificing compatiblity with existing 
software. Compatiblity with existing software is important 
because it represents an enormous investment for the computer 
user, and it is necessary to preserve this investment. The 
architecture presented in this paper is called the SAM 
architecture for reasons that will be explained in section 2. The 
features of this architecture are similar to those found in multi- 
threading machines [2][6][7], but are somewhat simpler. In 
spite of this, the features presented here can be used to program 
some of the more complicated features found in other 
machines. 


Section 2 describes the architecture and the primitive features. 
for supporting parallelism. Section 3 shows how the 
architecture supports arbitrarily complex static parallelism. 
Section 4 introduces features that support dynamic parallelism 
and multiple run-time environments. Section 5 discusses 
implementation issues, and section 6 draws conclusions. 


2. The Basic Architectural Features. 
The SAM architecture supports arithmetic and logical 


instructions as well as conditional and unconditional jumps. 
Initially it is assumed that conditional jumps perform both a 


* This research was supported by the university of South 
Florida Center for Microelectronics Design and Test. 
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comparison and a conditional jump, and that all instructions are 
memory to memory. It will be possible to relax. these 
restrictions later. Two addressing modes are supported, the 
full-address mode which provides direct addressing using full- 
width addresses, and the short-address mode which requires 
fewer address bits than the full-address mode and is used to 


access the low-address portion of memory. Short addresses 


may be either direct or indirect. The portion of memory 

addressable in the short-address mode is called the short- 

address space, and will be described more fully in section 4. In 

some implementations, portions of the short-address space may 
be mapped to registers or a high-speed cache. There are no 

programmer-addressable registers. 


The SAM architecture is a MIMD machine that allows the 
degree of parallelism to vary with time. Two types of 
parallelism are supported, static parallelism where the degree of 
parallelism is determined at compile time, and dynamic 
parallelism where the degree of parallelism depends in part on 
the data being precessed. There is no upper limit on the degree 
of parallelism. The features for supporting parallelism are 
motivated by the differences between the execution histories of 
sequential machines and those of data flow machines. On a 
sequential machine each instruction has exactly one predecessor 
and exactly on successor, while on a data-flow machine, each 
instruction has several predecessors and successors. In order 
to support parallelism whose degree varies with time, it is 
necessary to have instructions that have more than one 
predecessor and successor. In the SAM architecture the "split" 
instruction has one predecessor and two successors, while the 
"merge" instruction has two predecessors and one successor. 
These instructions form the core around which the rest of the 
architecture is designed, hence the name "SAM" for "Split And. 
Merge." The split instruction has the format of an 
unconditional jump, one successor is the branch target, while 
the other is the following instruction. The split instruction 
creates two independent instruction streams. The merge 
instruction has one operand that is normally initialized to zero. , 
When its operand is zero, the merge instruction sets it to 1 and 
terminates the execution of the current instruction stream. 
Otherwise the it sets its operand to zero and continues execution 
with the next instruction. The merge instruction operates 
atomically on its operand. Figure 1 shows how the 
combination of split and merge can be used to evaluate the 
statement "e=(a+b)+(c+d)." with the sub-expressions evaluated 
in parallel. 


1 split Ll 

2 add a,b,tl 
3 merge xX 

4 jump L2 

5 LI: add c,d,t2 
6 merge x 

7 L2: add t1,t2,3 


Figure 1. Parallel Evaluation of e=(a+b+(c+d). 


Most of the instructions Figure 1 are self explanatory. The 
labels "a," "b," "c," "d," and "e" are the variables named in the 
expression, while the labels "tl" and "t2" are temporary 
“variables. The label "x" is a temporary variable that is 


initialized to zero. The split instruction on line 1 causes the add 
instructions on lines 2 and 5 to be executed in parallel. The 
first two operands of these instructions are added and the result 
is placed in the third operand. In this example, a separate 
merge instruction is placed at the end of each instruction 
stream. An equivalent way to program this example would be 
to omit the merge instruction on line 3 and move the label "L2' 
from line 7 to line 6. 


Figure 1 shows that the split instruction adds three or four 
instructions of overhead to each stream (the two merge 
instructions cannot execute in parallel). If the end of both 
streams is moved to line 6, the overhead can be reduced to three 
instructions per stream. Assuming that all instructions execute 
in one time unit, each stream must be at least four instructions 
long for there to be any benefit from the parallelism. introduced 
by a split. However, if the split and merge instructions 
execute quickly as compared to the other instructions, the 
length of the stream could be reduced without negating the 
beneficial effects of parallelizing the code. 


At this point it is assumed that all instruction streams execute in 
the same environment, which restricts the way code can be 
parallelized. Methods for removing these restrictions will be 
discussed in section 4. 


3. Translating Data Flow Code to Split/Merge Streams. 


The translations presented in this section are based on the 
intermediate form of data flow code presented in [8]. A 
program is represented as a combinatorial expression of the 
form (C op0 ... opn), where C is an Abdali combinator[9], and 
opO through opn are the operands of the combinator. An 
operand may be a constant or another expression. The 


combinator may be of the form Bo i or K,,. If it is of the 


form Bn then op0 will be the name of an instruction and op1 
through opn will supply the operands of the instruction. If the 
combinator is of the form Ly op0 through opn will not be 
present, and if it is of the form K,,,, op0 will be a constant and 


opl through opn will not be present. For any expression, all 
subscripts will have the same value . A subscript of m signifies 
that m inputs are need to evaluate the expression. Combinators 


n 
of the form B_, are used to evaluate n-input functions, those of 


the form K,, are used to introduce constants into an 


: n 
expression, while those of the form Tn 2re used to select the 
nth input from a list of m inputs. For example, the expression 


x+y+l can be translated into 


the expression 


me 2 bs oe 
(B> + (B> + (15) 9) (Ky 1)). As was pointed out in [8] 


‘these expressions are simply linearized forms of data flow 
graphs. 


Because the language presented in [8] does not contain 
conditionals loops or assignments, no provision was made for 
handling them. In addition because the language is applicative, 
no provision was made for handling sets of independent 
Statements that communicate by side effects. To make the 
results of this section as general as possible, it is necessary to 
introduce the functions "if," "while," "assign," and “set” to 


236 


handle conditionals, loops, assignments, and sets of 
se n. 
independent statements. In addition, the new combinator Q,,, is 


introduced to distinguish between sets of statements that are 
independent, and those that have data dependencies. The 
( n 


na but when 


: nN. : 
combinator Q,,, is mathematically equivalent to B 


code is generated for the expression (By, xQ x1... xn) each 
of the expressions will be evaluated in parallel. When code is 


n 
generated for the expression (Q,, x0 x1... xn), the 


expressions x1 through xn will be evaluated serially. The Oo 


combinator can also be used near the "leaves" of an expression 
if it is necessary to place a lower bound on the length of 


independently executed instruction streams. To illustrate the 


n , 
use of the Q,, combinator, consider the usual form of the 


combinatorial expression for e=(a+b)+(c+d) which is 


2D BD AS 


is generated for this expression, the code-generation algorithm 
will create two independent instruction streams to evaluate the 
sub-expressions (a+b) and (c+d) in parallel. The two streams 
will be merged to complete the final addition. The following 


slight modification in the combinatorial expression will cause 


the three additions to be executed 


serially, 


(Beassign(l;(QsHBs+(Is)\s))(BS+(15)(12)))). 


Code can be generated for combinator expressions in a 
straightforward manner. A separate instruction stream is 
created to evaluate each operand of a 

B-type combinator, while the operands of a Q-type combinator 
are evaluated serially. The operands of "if" and "while" 
functions are always executed serially, although parallelism 
within the evaluation of the operands is not precluded. When 
code is generated for the body of a loop, it begins and ends as a 
single instruction stream, which prevents the iterations of a 
loop from getting out of sync. 


n 
If expressions of the form (Qn set x1... xn) are translated 


using the most straightforward algorithm, parallelism will be 
lost. For example, consider the following two-statement 
sequence. 


a=b+c 
e=(d+f)*a 


The subexpressions "b+c" and "d+f" can be executed in 
parallel, but if the statements are serialized due to the data 
dependency, this parallelism will be lost. A more sophisticated 
method of translating these functions is needed. The procedure 
is easier to visualize if it is assumed that the set of statements 
has been described as a data flow graph. Each node in the 
graph represents one statement in the set. (Complex 


expressions have been broken into separate statements.) All 


arcs that do not begin and end on a node are omitted, along 
with all duplicate arcs. The result is a directed acyclic graph 
with one or more source nodes and one or more sink nodes. 
Assume that there are ] source nodes and k sink nodes. Since 


the source nodes use only those data items that are assumed to 
be present before the execution of the set begins, they can all be 
executed in parallel. The code for the set begins with a j-label 
msplit instruction, which is the single predecessor of each 
source node. Similarly, the code for the set ends with a k-label 
mimerge instruction, which is the single successor of every sink 
node. Msplit and mmerge are standardized sequences of 
instructions that create and merge an arbitrary number of 
streams. An n-label msplit acts as an n-way branch, while an 
n-label mmerge acts as an n-label branch-target. Their 
construction is straightforward. The code for a node with m 
‘predecessors and n successors begins with an m-label mmerge 
instruction and ends with an n-label msplit instruction. In each 
case the labels on the msplit instruction match the labels on the 
mmerge instruction of the successor nodes. The nodes of the 
data flow graph can represent arbitrarily complex expressions 
and are not restricted to individual instructions. A node may 
have a high degree of internal parallelism as long as it has a 
single entry and a single exit. This method of translating set 
functions allows arbitrary acyclic data flow graphs to be 
implemented using the split and merge instructions. An 
example of this procedure is illustrated in Figure 2. 


msplit LAILLBI1 
mmerge LAI 


oe (csieens 
msplit LCI.LD1,.LE1 
mmerge LB1 


mmerge LCl 
msplit LG1 


Figure 2. Data Flow Parallelism with Split and Merge 
4. Creating Multiple Environments. 


Although a high degree of parallelism can be realized with the 
split instruction in a single environment, multiple environments 
are needed to support dynamic parallelism and shared 
subroutines. One method of supporting multiple environments 
would be to have several data and address registers that are 
replicated for each instruction stream. Such a mechanism is 
used in some multiprocessors, but since not all independent 
instruction streams require separate environments, it is 
desirable to separate the function of creating an instruction 
stream from the function of creating a new environment. Recall 
that the SAM architecture provides a short-addressing mode 
that is used to access the short-address space. Associated with 
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each instruction stream is a register called the prefix register that 
contains the location of the short-address space. The prefix’ 


register is assumed to contain the high-order address bits of the 


short-address space, with the low order bits being supplied by 
the instruction. The instructions "readp" and "writep" are. 
provided to read and write the prefix register. Instruction 
streams that require separate environments may use these 
instructions to create a new short-address spaces. The current 
value of the prefix register is replicated on a split which causes 
the two independent streams to execute in the same 
environment. Since the two streams will generally begin 
execution at two different points in memory, distinct 
environments can be created for each stream. The merge 
instruction does not affect the contents of the prefix register. 
Although the prefix register can be used to solve the problem of 
dynamic parallelism and the problem of calling the same: 
subroutine in two different instruction streams, the stack-based 
addressing scheme for passing arguments and saving return 
addresses cannot be used in a multi-threading environment 
without elaborate support mechanisms or rigid controls on how 
itis used. In a multi-thread environment it is not possible to 
predict when memory for one set of arguments will be 
deallocated with respect to the memory for another sets. In the 
SAM architecture, allocation of space is made the responsibility 
of either the support software or the compiler, because the most 
efficient method for doing so depends on the problem being 
solved,. Efficient implementation of the basic features of the 
architecture should allow many different allocation schemes to 
be programmed efficiently. 


To illustrate how the prefix register can be used to achieve 
dynamic parallelism, consider the code illustrated in Figures 3 
and 4. It is assumed that each instruction stream requires an 


environment of size 2%, and that 2! instruction streams are to be 
created. In addition to a number of temporary variables, each 
environment contains its starting address, its size, the number 
of the current instruction stream, the total number of instruction 
streams, and a pointer to the parent environment. There is also 
a word initialized to zero, which will be used as a merge target. 
Figure 3 illustrates the serial creation of instruction streams, 
while Figure 4 illustrates the logarithmic creation of instruction 
streams. The logarithmic initiation of instruction streams 
operates by creating a single environment of size 2X+1 and 
repeatedly splitting it in half until environments of the proper 
size have been created. In the process the proper number of 
instruction streams will be initiated. 


allocate 2X+1 bytes; 
for j=1 to 2} 
init env j and make it current; 
split to shared_code; 
endfor 
for j=1 to 21 
exec merge in env j; 
endfor 


shared_code: 


for j=current_stream to 2} 
execute Merge in environment j; 
endfor; 


Figure 3. Serial Stream Initiation. 
Creating a separate environment for each stream causes the 


overhead for each stream to be greatly increased. When 
multiple environments are being used, it may be more 


allocate 2X*1 bytes; 
init as one size 2X+1 env & make current; 


while (env_size > 2*) 
env_size = env_size / 2; 
init new env in second half of current env & 
make current; 
split to x; 
restore parent env 
x? 
endwhile; 
... Shared code ... 
while (total_streams > 1) 
if (curr_strm_num is even) 
make env at env_adr+env_size current; 
endif; 
exec merge; restore parent env; 
divide curr_strm_num and total_streams by 2; 
endwhile; 
restore parent env; 


Figure 4. Logarithmic Stream Initiation. 


convenient to treat the independent streams as individual 
processes that communicate through a producer/consumer 
structure as proposed by several others [2]. The simplest way 
to model the producer/consumer relationship is to follow the 
producer by the instruction "split x" and precede the consumer 
by the instruction "x: merge k" where k is a variable that has 
been initialized to zero. ° Overruns can be prevented by 
preceding the consumer by the instruction "y: merge j" where j 
is a variable that is initialized to one, and following the 
consumer with the instruction "split y." This scheme will work 
only if each data item has a single producer and a single 
consumer. To support multiple producers and consumers, it is 
necessary to introduce the hardware equivalent of semaphore P 
and V operations. The P operation is modeled by the. "seq" 
instruction,which has a single operand. If the operand is non- 
zero when the "seq" instruction will set it to zero and 
instruction execution continues with the next instruction. If the 
operand is zero, both it and the program counter for the current 
stream remain unchanged. The seq instruction operates 
atomically on its operand,and executes repeatedly until its 
operand is set to zero. The detrimental effect of the busy wait 
can be minimized by spinning the seq instruction off into a 
separate instruction stream. The.seq instruction can be used for 
multiple-producer multiple-consumer problems and other types 
of synchronization. 


5. Implementation Issues 


The SAM architecture will be implemented as a shared pipeline | 


similar to that found in the HEP multiprocessor[2]. The 
number and function of the pipeline stages is not fixed by the 
architecture, but for definiteness consider the pipeline pictured 
in Figure 5. This is a typical pipeline augmented with two 
additional stages to fetch and write stream descriptors. Each 


stream descriptor contains the current PC for the stream as well. 


as the current prefix register value. The descriptor may contain 
other items as explained below. 


Figure 5. An Augmented Pipeline. 


‘The descriptor fetch stage of the pipeline obtains descriptors 
from many different sources. In particular, each stage of the 
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pipeline can serve as a source of descriptors, which allows the 
pipeline to be fully utilized even when only a small number of 


_ descriptors exist. When descriptors are fetched from the earlier 
stages of the pipeline, potential pipeline hazards must be 


provided for either in the hardware, or by some software 
scheduling technique. 


The descriptor write stage of the pipeline provides for internal 
buffering for descriptors. When the.internal buffer of the 
descriptor write stage is full and a split instruction creates a new 
descriptor, several actions can be taken. The descriptor write 


stage can provide storage management for a circular buffer in 


some form of backing store, or it can cause a fault to occur 
when the number of streams reaches a high-water mark. The 
support software has the choice of suspending the execution of 
the stream until the number of streams fell below a low-water 
mark, or of passing the descriptor to a second processor. 


To support the simultaneous execution of several processes, 
each of which may have several instruction streams the SAM 
architecture provides features for handling interrupts and faults. 
An interrupt is handled by initiating a new instruction stream in 
response to an external event. The "return from interrupt” is 
accomplished by executing a merge instruction with a zero 
argument. An interrupt vector consists of a pointer to the 
executable code for handling the interrupt, and a pointer to the 
short-address space for the interrupt handler. 


Since several program faults of the same type may occur 


simultaneously, it is necessary to have some method of 


serializing the first portion of the fault handler, which allows 
the descriptor of the offending instruction stream to be copied. 
into the environment of the fault handler without destroying 


data that is still needed to process a previous fault of the same 


type. This problem is solved by adding a global register for 
masking faults, and a status register to the descriptor of each 
stream. When a fault occurs and the corresponding fault-type 
is masked, the descriptor of the offending stream will have a 
“suspended” bit set in its status register. A descriptor with the 


suspended bit set propagates through the pipeline without 
_ change. When the descriptor reaches the stage where the fault 
originally occurred, the stage will schedule the fault-handler. if 


the fault type is now unmasked. A more expensive solution 
would be to allow the descriptor write stage of the pipeline to 


queue descriptors waiting for the fault to become unmasked. 


At times the support software may need to terminate a process 


in response to a program fault or other event. Because each 
process can have many instruction streams active 
simultaneously, some mechanism is needed to identify and 
terminate all instruction streams belonging to the terminated 
process. To solve this problem a process-id, which can be 
used to identify and terminate instruction streams, is added to 
the descriptor of each stream. The process-id is copied into the 
new descriptor when a split instruction is executed but it may 
be changed by the support software. One way to accomplish 
this is to combine the assignment of process-ids with memory 
management. For example, the process-id could be a pointer to 
the segment or page table for the process. The memory 
management hardware could be used to force the termination of 
instruction streams. Another method is to pass the process-id 
of a failed process to the descriptor write stage of the pipeline 
and allew this stage to purge all descriptors with matching 
process-ids. 


The addition of status bits to the stream descriptor permits the 
implementation of privileged instructions, allows more 
conventional compare and conditional jump instructions to be 
used, and allows for local masking of faults in the instruction 


streams. There are a number of implementation issues that 
remain to be solved, but these should be readily addressed as 
work on the SAM architecture progresses. 


6. Conclusion. 


The SAM architecture is the first step in developing an 
inexpensive method for supporting data flow style parallelism 
in a von Neumann architecture. The features presented here are 
intended to be inexpensive to implement, and easy to use by a 
compiler. Although the split and merge instructions are simple, 
it has been shown that they can be used to implement arbitrarily 
complex static parallelism in a single environment. Using the 
prefix register to create multiple environments, it is possible to 
implement arbitrarily complex dynamic parallelism at a cost 
somewhat higher than that for static parallelism. 
Implementation issues have been discussed that allow for 
multiple processes as well as interrupt and fault handling. It is 
hoped that the SAM architecture will soon lead to the 
development of one or more small inexpensive 
multiprocessors. 
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Dynamic Structured Data Flow: 
Preserving the Advantages of Sequential Processing 
in a Data Driven Environment 
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Tel Aviv, Israel 


Abstract. An architectural model is presented which enjoys the 
automatic sequencing of parallel operations characteristic of 
dataflow. However, the processors employed incorporate program 
- counters and execute dependent sequences of actors in a 
sequential, fetch/execute, Von Neumann fashion. The synthesis 
of these -- ordinarily opposing -- approaches, is achieved without 
sacrificing the fine grained parallelism of classical dataflow. As a 
theoretical model, the machine is shown to achieve uninterrupted 
sequential execution of all critical paths in an arbitrary algorithm, 
subject to a single class of system overhead: context initiation. 


Introduction 


Dataflow computing systems have generally been motivated by 
the need to get away from an underlying machine model which 
is inherently sequential, to one which naturally supports 
parallelism. Proceeding from this fundamental perspective, 
proponents of these systems argue that we require a machine 
which relates independently to each elementary computing 
activity, scheduling them for execution subject only to 
dependence constraints inherent in the algorithm. The notion 
of independently scheduling each activity however, does not 
properly follow from the first premise. A cursory look at 
almost any computation graph shows that there are paths of 
computation activity of significant length; i.e. groups of 
activities that are inherently sequential. By saying that these 
sequentiality constraints are imposed by the algorithm we do 
not change the fact that they would be much more efficiently 
executed by a sequential processor than by one-at-a-time 
scheduling of each instruction. These and other efficiency 
problems of classical dataflow have been documented by Kuck 
and others (see for example [5]). 


Indeed no processor can be more than a sequential machine; 
even the execution units in dataflow systems can only do one 
thing at a time. The goal of parallel architectures is to have 
many such units working together. This goal should not 
obscure the fact that an individual processor can perform 
significantly better by being optimized for sequential 
processing. Two decades of engineering experience with 
uniprocessor CPU's has taught us how to incorporate 
instruction caching, lookahead, pipelining etc.; forcing an 
execution unit to work on elementary tasks that are 
independently scheduled is retrogressive. Further, a typical 
dataflow machine consists of a long pipeline of independent 
units operating asychronously to each other. Apart from the 
length of the pipe itself, the full handshake protocols required 
at each unit to unit interface exact a significant price in 
performance -- over comparable synchronous systems. 


In [3] a basic deficiency of the Von Neumann architecture was 
eloquently expressed: A single active processor is controlling a 
passive memory state. Elaborating, we may say that since the 
contents of memory represent the algorithm, the Von Neumann 
machine has the algorithm as the passive agent, with an 
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additional artificial level of control being imposed upon it by 
the single CPU. We require the opposite arrangement: the 
algorithm should dictate control of the system resources, with 
processor power being merely one such resource. However, 
as noted above, the algorithm itself dictates sequential 
execution along paths in the computation graph (perhaps many 
such paths in parallel), as often or more often than not. 


For algorithms which have static structure, i.e. no branches or 
loop variables that are determined dynamically, we can partition 
the computation graph into a set of sequential paths that 
communicate with each other (see e.g. [8]). Paths are statically 
allocated to different processors. If an operand being 
communicated from path A to a node b on path B, does not 
arrive in time for consumption by the processor executing B, 
the latter is suspended via a form of exception handling. The 
path is restarted when the operand arrives. The partition 
chosen assumes a worst case time for each elementary 
operation and tries to minimize the number of such "operand 
late" exceptions. A method for finding an optimal partition is 
given in [7]. The Hughes Data Flow Machine [4] and other 
projects have incorporated similar techniques in their designs. 


This approach is workable for static programs. In dynamic 
code however, it is clearly not possible to allocate paths 
beforehand; we have no idea -- before the program actually 
runs -- how the partition should look. In this paper we 
propose a parallel architecture in which any processor may be 
conscripted to execute dynamically determined segments of 
sequential code. Thus the unit which we schedule for 
execution is an execution path rather than a single elementary 
activity as in classical dataflow. On the other hand we wish to 
realize Backus’ directive for an architecture in which the 
algorithm is the active agent allocating passive resources. 
Accordingly, execution paths will be dynamically scheduled in 
accordance with the arrival of operands; no central control is 
imposed by a CPU. In particular our approach does not dictate 
the level of granularity used; parallelism at the lowest level may 
be exploited. We have dubbed this abstract machine Dynamic 
Structured Data Flow (DSDF). 


In what follows we shall assume that a name is associated with 
each token generated by a computation, after the manner of e.g. 
the U interpreter [1]. Tokens are matched in a matching store 
to form executable operand pairs. This approach elegantly 
solves the problem of different contexts and also provides a 
convenient conceptual separation of instructions and data. 
Other reasons for this choice are elaborated in the sequel. 


An additional assumption which we shall make in the DSDF 
architecture is that all processors have local access to all 
program code. The issue of implementation will be treated in a 
later section of this paper. 


The Execution Discipline 


In what follows, we assume an algorithm is represented by a 
directed graph G, in which arcs represent the flow of data and 
nodes correspond to activities. A weight may be associated 
with each node corresponding to its execution time. 


Definition 1; A sequential instruction path(SP) is a subset of 
the activities in G which are linearly ordered with respect to 
dependence and where, for activities a and b in the SP, if a< 
b then there is no c not in the subset such thata<c <b. 
Equivalently, an SP is a path in G. 


Definition 2; An execution process(EP) will denote an instance 
of execution of an SP. An EP is created by asssociating it 
with a particular instruction; not with a general SP. Since a 
single instruction is also an SP, the resultant process created 
is in fact an EP consistent with this definition. However, an 
EP will, in general, progress with its activity in such a way 
as to execute an arbitrary SP. All activity of an EP is a 
coherent unit: execution is not data driven but rather proceeds 
according to sequence. If an operand required for the 
execution of some node along the SP has not arrived, the 
result is an EP disabled exception . EP's progress according 
to the EP execution rule given below. 


Definition 3; A node which represents a unary instruction, or 
for which one operand has been made available, is called EP 
enabled. 


EP e xecution rule: If a node v, with out-degree > 0, has been 


executed by an EP a then: 
1) if the out-degree of vy =1, and its successor v’ is EP 


enabled, a proceeds with the execution of v’ . 


2) if the out-degree of v is > 1, a proceeds with an 
arbitrary successor v’ which is EP enabled. 


In all cases, « COMmunicates its result to any successors 
with which it does not continue. An EP terminates at a node 
v when either a) v has out-degree 0, or b) none of its 
successors are EP enabled. 


If an operand is COM'd to a node which is EP enabled, a 
context initiation (CI) results. The CI corresponds to the 
creation of a new EP. 


The net effect is that an EP is created to execute an SP of 
unknown extent. Moreover, even specifying an EP by initial 
node and length does not uniquely determine the SP which will 
be executed; more than one path may be possible. 


To fix the above ideas more concretely let us consider the 


creation, execution and termination of a particular EP ao, in a 
DSDF machine. All operands will be deposited in a central 
associative memory, with their tags as a search key. We 
associate with each EP a Home(H) datum; this is the datum 
kept by the processor in its internal register for the duration of 
an EP execution. It may be thought of as the context for the 
EP. A processor executing a binary operator node w uses its H 
datum as one operand and the tag of its H datum to address the 
token memory to retrieve the second operand. However, the 
value of the token is also sent in the memory access. If the 
match fails, the access is interpreted as a write; the value and its 
tag are installed in the memory, to await the arrival of a match. 
If the match succeeds, the access is interpreted as a read; the 
operand is returned to the processor. Thus the COM operation 
in the execution rule is no different in practice from an ordinary 
access to token memory. We associate the notion of 
communication with a token when the arc being traversed is not 
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on the execution path of any EP. This is the only case of 
inter-EP communication in the DSDF machine. Returning to 


the specific example, some EP 8 created a when it executed a 
node v which had more than one successor that was EP 


enabled. That is, an operand was COM'd by 8 from vy to say v’ 
, resulting in a CI. In terms of our architecture, the instruction 
v specified two successors, i.e. two context tags. The 
processor executing v successfully retrieved the second 
operand from the matching store for one of its successors and 
continued execution with that path. For the other successor v’, 
it sent the tag and value to the store with an indication that it is 
busy. The matching store controller detects the condition 


“match-with-processor-busy" and initiates a new EP a. The 
result of executing v’ becomes the H datum for a; the H datum 


undergoes transformation with each node executed by a and 
provides the context with which matching operands are 


retrieved for each subsequent instruction. ao terminates when 
all attempts to retrieve a matching operand fail, or when an 
instruction specifies no successor. 


Decision and loop constructs are handled by the usual branch 
type instructions rather than the elaborate switches of dataflow. 
This is possible because SPs are sequential sequences of 
instructions meant to be fetched/executed by a processor which 
incorporates the usual program counter. The arrangement for 
an If-Then-Else construct incorporating a single SP is shown in 
figure 1(a). BC denotes a Branch on Condition. 


If more than one SP is involved in an If-Then-Else, we can 
provide a compound SP as a program structuring aid. This is 
a group of static paths that communicate with one another, and 
which have a distinguished set of inputs and outputs. The 
If-Then-Else is then constructed from 3 compound SPs, as 
shown in figure 2. SP1 and SP2 are the alternate code 
sequences to be chosen; they must match in their numbers of 
inputs and outputs. The third SP is a control token generator ; 
it may take some subset of the inputs to SP1/SP2 and outputs 
the single boolean token which controls the IF construct. 


The BC instruction is shown here accepting the boolean control 
token as a second operand. This operand is in effect 
communicated to it by an SP within the control token 
generator. An important issue is how to efficiently implement 
the distributor. We shall return to this problem in a later 
section. 


Figure 1(b) illustrates looping in the DSDF machine. The usual 
tag operators for a dynamic machine are employed, allowing 
many instances of the Loop to execute concurrently. The loop 
body may be extended to a compound SP in a manner 
analogous to the IF construct; tag operators are added for each 
sequential path as required. 


We now proceed to characterize the performance advantage 
accruing to the DSDF execution discipline. 


Definition 4: Let v’ , a successor of v, be a node which 
experiences a context initiation(CI). At the time of the CI, v’ 
must have been EP enabled. If v’ is a binary operation, let 
the node which provided its first operand be other than v. If 
the EP executing v continued with some other successor, 
say v”, the CI experienced by v’ is called inherent . 


Since v continued with some other successor v”, it follows by 
the execution rule that v’” could not have been executed by an 
EP arriving via another predecessor, say w , as follows: v” has 
two predecessors, v and w, and hence is a binary operation. 
For the EP executing w to have continued with v”, it would 


have had to find it EP enabled by v, in which case the EP 
executing v could not have continued it. Hence the EP 
executing v was the only one that could have continued with 
either of the nodes v’ orv”. Stated differently, 


Lemma 1; An inherent CI can be avoided only at the cost of 
another CI. 


Hence the name -- inherent. Note that the question of whether 
a CI is inherent or not is dynamically determined, as illustrated 
in figure 3. If b arrives at c before d, there will be no inherent 
CI, while if d arrives first, the EP executing b will be the only 
one able to continue with either a or c, hence an inherent CI 
will occur at one of them. 


For a computation graph G, let G,(E) be the graph derived 
from G by "unravelling" all loops, specifying all branch 
decisions and ascertaining the actual execution time of every 
node, in accordance with some particular execution instance E 
of G. Gy(£) is a weighted, directed, acyclic graph. If E 
executed on a DSDF machine, G,,(E) will have been partitioned 
into a set of SPs which communicate with one.another. In 
particular, the EPs generated by the DSDF execution constitute 
a path cover for G,(£) in which arcs in the cover correspond to 
transformations in place on the Home datum of a processor, 
and other arcs correspond to communication between EPs. We 
would expect that an EP may be held up by late arrival of 
operands which must be communicated by other EP's. Such 
an EP will have to be suspended and in effect, broken in two 
EP's -- each of which executes without interruption. Some 
optimal path cover exists for G,(E) which allows maximal 
uninterrupted sequential processing. That is, if a processor is 
allocated to each of the paths in the cover, the number of 
uninterrupted sequences executed will be minimal, or 
equivalently, the average length of uninterrupted sequences will 
be maximal. This is similar to what is done in [7] for static 
programs; in our case however, G,,(E) does not exist until 
execution completes and hence the optimal cover is undefined a 
priori. 


Lemma 2: Let G,(£) represent an instance of execution of G. 


If E was executed by a DSDF machine, then the following 
hold: 


1) 
2) 


3) 


all nodes in G,,(E) will have executed, 


all EPs generated in the course of EF progress without 
interruption, and 

if an EP terminates at a node v then either the 
out-degree of v is 0 or any CIs experienced by 
successors of v must be inherent. 


Proof: A node v’ fails to be executed by the same EP which 
executed its predecessor v in one of 2 cases: 


a) v is not EP enabled. This implies some other 
predecessor w, which must supply a second operand to 
v’. Hence when the EP executing w checks v’, it will 
find it EP enabled. Execution will continue with v’, or 
with some other successor v”’, which is also EP 
enabled. For the latter, case (b) below, applies. 


b) The EP executing v continued with some other 
successor. The rule specifies that a CI occurs, a new 
EP is created and hence v’ is executed. 


The DSDF machine begins by initiating an EP for each node 
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in G with in-degree 0. by induction and the application of (a) 
and (b) above, we have that all nodes in G,,(E) are executed. 


Assertion (2) follows directly from the execution rule: there 
are no provisions for suspension/resumption; an EP can only 
progress or terminate. | 


Further, a CI at a node v’ occurs only. if the node is EP 
enabled and subsequently has an operand COM'd to it from 
some predecessor v. By the execution rule, there must have 
been some other successor v’’ of v, also EP enabled, to 
which the EP executing v continued. The CI occuring at v’ 
is therefore inherent, whence all CI's which occur in the 


course of EF must be inherent. | 


Our basic result follows directly from the Lemma. We say that 
a path p in G,,(E) is a critical path, if w(p) = w(q) for all paths 
q, where w(p) denotes the sum of the weights of the nodes on 
p. As before, weights correspond to execution times. 


Theorem 1: All critical paths in G,(E) are executed in an 


uninterrupted sequential manner except for the system 
overhead associated with CIs. Further, only inherent Cls are 
experienced by a critical path. 


Proof: Let p be a critical path. If p is executed by a single EP, 
then by Lemma 2(2) it cannot be interrupted. Let v be the 
node at which the first EP executing p terminated. If the 
successor v’ of vy on p is not EP enabled, there must be some 
other predecessor w of v’ on an path g, which supplies the 
other operand to v’, and which has not yet done so. Then p 
cannot be a critical path to v’ because w(q) > w(p) up to the 
node v’ and hence p cannot be a critical path in G,(E). 


Thus if the EP executing p -- a critical path -- terminated at v 
and sent an operand to v’, the latter must have been EP 
enabled. By the execution rule a CI immediately results and 
an EP is created which continues with v’. By Lemma 2(3) 


this CI must be inherent. 1 


It is important to note that we have used the notion of a critical 
path in a somewhat loose fashion. In particular, it is critical 
paths in G,(£) that we are treating in Theorem 1. G,(£Z) 


represents execution on a DSDF machine, and CI times are 
treated as part of the weight value of nodes which experienced 
them. In terms of the execution times of elementary operations 
alone, it may be that w(qg) > w(p) holds for some instance of 
execution of G, while if CI times are added to the weights, the 
inequality is reversed. However, only inherent CIs will occur 
regardless of the amount of time they add to a path's execution. 
Further, it is clear that as the time overhead of a CI approaches 
zero, the critical paths of Theorem 1 become determined only 
by the inherent costs of the operations of the algorithm. 


In sum, the EPs generated by a DSDF computation never wait 

-- if they are on a critical path. Further Theorem 1 

demonstrates that the theoretical limit of parallel speedup can be 

a for any algorithm to the extent that we can reduce the 
I time. 


Since processors in DSDF are active and access the token 
memory in much the same way as a Von Neumann CPU, many 
of the pipeline stages of a typical dataflow machine, e.g. packet 
formation, packet arbitration etc., are eliminated. Tighter 
coupling and synchronous communcation protocols between 
processor and memory should be possible. 


Conclusion 


stack loop 
domain context 


Theorem 1 highlights the central role played by Cls in system 
performance. If we bring specialized architectural resources to 
bear on this particular function, we should expect significantly 
increased performance. In a related paper [6], a new process 
spawn technique is described, which is designed to minimize 
system overhead in the DSDF environment. The method 
provides a standard interface from algorithm to processors 
which is used in a uniform fashion to support all basic code 
constructs which generate multiple processes. The ‘distributor’ 
node discussed earlier is easily treated as a special case. 


LOOP BODY 


STATIC 
PATH 


increment 


A basic difficulty to be overcome is that of contention by RenEES Ierallenk: 


multiple processors for the matching store. The usual solution 
to memory contention -- distributed access via interleaving of 
modules -- is not directly applicable to an associative store 
because a central controller must match the tag against all 
locations. In the Irvine machine model for example [1,2], each J2:> 
processor node has its own matching unit. Code is statically 

distributed to the different processors and no runtime variation 

on this partitioning is permitted. The same type of static criteria (a) Figure 1 
are used to decide which matching store in the network or ring 
is to receive result tokens generated by the program. This 
approach however, is unlikely to permit realization of the kind 
of performace potential described in this paper for DSDF: 
uninterrupted sequential execution of critical paths. The 
assumption that code and data are allocated statically 
necessarily implies that processor resources are not, in general, 
available for any executable activity. Only a matching store that 
is accessible by all processors would allow this generality of 
resource distribution. This issue, among others is being 
explored by a research group at Bar Ilan University which is 
developing an architecture and programming system based on 
DSDF principles. 


EXIT-—> 


References Static Path I 


[1] Arvind and Gostelow, K.P., "The U-Interpreter," /EEE 
Computer, Vol. 15, No. 2, February 1982. 


[2] Arvind, Kathail, V., and Pignali, K., "A Dataflow Architecture JMP J2 
with Tagged Tokens," Rep. LCS/TM-174, MIT, September J1:-> 
1980. F 


[3] J. Backus, "Can Programming be Liberated from the Von 
Neumann Style? A Functional Style and its Algebra of 
Programs," Communications of the ACM, Vol. 21, No. 8, 
August 1978, pp. 613-641. 


Static Path 2 


J2: => SY Yan ae = 8 

[4] M.L. Campbell, "Static Allocation for a -Data Flow 
Multiprocessor," Int'l Conference on Computer Architecture, 
1985 


[5] D.D. Gajski, D.A.Panda, D.J.Kuck and R.H.Kuhn, "A Second 
Opinion on Dataflow Machines and Languages," Computer, Foure2. 
Feb. 82, pp. 58-70. 


[6] Gottlieb, |., “Efficient Process Spawning in Functional 
Multiprocessor Environments," Technical Report 
TR-CS102PA, Dept. of Computer Science, Bar Ilan University, 
January 1988. D d 


[7] Gottlieb, |., "The Partitioning of QGDF Computation Graphs," to 
appear in Distributed Computing, 1988 


a c 
[8] Gottlieb, |., "SDF-The Structured Dataflow Model of Computing 
and its Architecture,” First Int'l Conference on 
Supercomputing, Dec. 1985 Figure 3 


243 


ITERATIVE ALGORITHMS IN A DATA-DRIVEN ENVIRONMENT* 


Paraskevas Evripidou and Jean-Luc Gaudiot 
Computer Research Institute 
Department of Electrical Engineering-Systems 
University of Southern California 


Los Angeles, California 
(213) 743-0249 


Abstract— Data-flow principles of execution are an 
elegant way to synchronize many parallel processes in a 
large scale multiprocessor system. However, the execution 
by runtime detection of data dependencies also introduces 
many inefficiencies. In this paper, we apply the data-flow 
principles to a numerically intensive application: the Ja- 
cobi method for solving linear systems. We introduce a 
modification to the algorithm which allows a full exploita- 
tion of the parallelism inherent in the method by “vectoriz- 
ing” a portion of the calculation and allowing some amount 
of “look-ahead” in the termination criterion. Resource al- 
location issues are then considered and we demonstrate by 
a combination of analytical and simulation methods a pri- 
ority mechanism which allows both an increase in perfor- 
mance as well as better resource utilization. 


1 Introduction 


The computing needs of the near future are far beyond the 
power of any supercomputer available today. Physical con- 
straints are placing an upper bound on the speed of single 
processors. Current technology is rapidly approaching this 
limit. A natural solution consists of having many proces- 
sors collaborating to solve large problems. However the ba- 
sic principles of von Neumann architectures preclude their 
extension to parallel execution environments [1]. Data- 
flow principles of execution on the other hand, offer easy 
programmability and tolerance to high memory latencies 
which are inevitable in large scale multiprocessors|2]. Iter- 
ative algorithms are very powerful tools for solving linear 
systems, and are particularly efficient in the solution of 
large sparse systems. These sparse systems are frequently 
encountered in the solution of Partial Differential Equa- 
tions. 

The data-flow model of execution [3] represents programs 
as graphs. The nodes (actors) of a data-flow graph are the 
instructions of the program. Tokens flow along the arcs 
carrying data from the producer actors to consumer ac- 
tors. The static model of execution has been described 
by Dennis [4]. This model of execution allows only one 
instantiation of each actor at any given time. In the “dy- 
nainic” model (Arvind et al.) [2], multiple instantiations 
of the same actor are allowed. This is done by associating 


: _*This material is based upon work supported in part by the U.S. 
‘Department of Energy under Grant No. DE-FG03-87ER25043 
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a.different color (tagging) with the tokens which belong to 
different instances of the same actor. The rules for tagging 
the tokens are referred to as the “U-Interpreter”. 

Dynamic data-flow provides an efficient way of exploring 
the parallelism present in an algorithm. It has been shown 
[2] that the U-Interpreter principles are particularly effi- 
cient in conjunction with the FORALL type of constructs. 
This is the same type of construct which the von Neumann 
model of execution targets for optimization through vector- 
ization. However, iterative algorithms have been tradition- 
ally implemented by using REPEAT-UNTIL and WHILE 
constructs. The U-interpreter cannot unravel these loops. 
These REPEAT-UNTIL loops can become more efficient 
for parallel execution if a FORALL (For z = 1,7) is in- 
serted into their body. This allows the U-interpreter to 
simultaneously unravel n iterations instead of merely one. 
This paper examines the behavior of iterative algorithms in 
a dynamic data-driven environment and the enhancement 
in performance provided by the REPEAT-UNTIL transfor- 
mation. We analyze our proposed scheme by a determinis- 
tic sunulator of a dynamic data-flow architecture. 

The goal of this paper is thus to study the performance 
of a dynamic data-flow architecture applied to a numeri- 
cally intensive problem. A modification of the scheme is 
then introduced and a new execution priority mechanism 
is analyzed. In Section 2, we review essentials of the Jacobi 
method for solving linear systems. We also briefly discuss 
some data-flow principles relevant to the implementation of 
iterative algorithms. In Section 3, our transformation tech- 
nique is introduced and simulation results are provided. 
The priority mechanism is presented and analyzed in Sec- 
tion 4, while concluding remarks are made in Section 5. 


2 Iterative Algorithms in a Data-Driven 
Environment 


Iterative techniques are very frequently used for the solu- 
tion of systems of equations. An iterative technique to solve 
an nxn linear system Ax=b starts with an initial approx- 
imation x to the solution x, and generates a sequence of 
vectors {x)1e° | which can be shown to converge to x. 
The Jacobi method for solving linear systems is shown by 
equation 1. 
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If A is strictly diagonally dominant, then for any choice 
of x, the Jacobi method gives a sequence {xr hea that 
converges to the solution of Ax=b. 


Graph construction 


The simulation model follows in principle the U-interpreter 
model of execution. The architectural model is a 64 Proces- 
sor hypercube, based on the MIT Tagged Token Dataflow 
Architecture [2]. Hach operation (match, fetch etc.) takes 
one time unit. Also one time unit delay per communication 
hop is assumed. 

Loop indices generation and the treatment of condition- 
als is the dominant part of data-flow graphs for iterative 
algorithms. The lack of global state and the single as- 
signinent principle make the data-flow graphs (programs) 
fundamentally different from conventional programs. The 
handling of loop indices receives different treatment in 
a data-flow environment. Consider the following nested 
loops: 

For i in 1,k cross j in 1,1 
In a von Newinann environment, the variable 2 will be up- 
dated k times while 7 will be updated k x / times. However 
in a data-flow environment there is no notion of variable, 
therefore the valuez has to be created k x/ times. The samme 
holds for 7. In other words, all the indices of the outer loops 
have to be created as many times as the index of the inner- 
most loop. Another notable characteristic of the data-flow 
environment is the treatment of conditionals. Each input 
value of the true block of a conditional has to be gated 
through a true gate (T). In a similar fashion, each input 
value to the false block has to be gated through a false 
gate (F). This means that the tokens carrying the condi- 
tional must reach all the gates involved and all gates will 
fire. Loop indices and conditionals introduce a lot of “com- 
munication/synchronization” overhead in data-flow graplis 
of iterative algorithms. The transformation technique pre- 
sented in the next section targets this overhead for a more 


efficient execution. 


3 Transformation Techniques 


In a data-flow environment all the parallelism present in 
an algorithm is inherently preserved, nevertheless, some 
changes in the implementation of algorithms can help take 
full advantage of the potential of the data-flow principles. 
In the remainder of this Section, our transformation tech- 
nique for improving the performance of iterative algorithms 
in a dynamic data-flow machine is described. 


3.1 Transformation algorithm 


Iterative algorithms have traditionally been implemented 
in a step at a time approach. This was very natural at 
the pre-computer era since scientists and mathematicians 
would typically manually undertake the procedure. The 
same approach is very natural in conventional von Neu- 
mann architectures, because the (single) human brain is 
replaced by a single powerful processor. In von Neumann 
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architectures, iterations are handled by REPEAT-UN'LIL 
and WHILE constructs. The stopping criterion is natu- 
tally checked at each iteration. These REPEAT-UNTIL 
constructs severely limit the performance of parallel pro- 
cessors because they cannot be vectorized and/or multi- 
tasked. In addition, the performance of parallel processors 
is restricted by the fact that the stopping criterion is calcu- 
lated at each iteration which usually involves a lot of syn- 
chronization overhead. This synchronization overhead is 
sequential in nature which has been shown (Amdahl’s law) 
to have a very negative effect on the maximum achievable 
speedup. | 

For the great majority of iterative algorithins, we can 
theoretically estimate the order O(n) of the number of it- 
erations needed. When a computation is expected to take 
100 iterations, for example, it does not serve any purpose 
to test the stopping criterion during the early iterations. 


3.1.1 Basic Principles 


Reduction of the overhead is possible by inserting a 
FORALL loop ( [| For « = 1,n] loop) inside a REPEAT- 
UNTIL construct. ‘This allows us to check the stopping 
criterion every n iterations. In addition to the reduction of 
ovethead due to the decrease in the number of evaluations 
of the stopping criterion, we can execute some parts of the 
various iterations in parallel. 

The basic form of the modified Jacobi implementation 
is shown in Figure la. Figure 1b shows the traditional 


implementation of the Jacobi algorithm. 


n = expected_number_of_iterations(...) 
REPEAT 
For i=i,n do Jacobi(...) 
check_stopping_criterion(...) 
n = evaluate_n(...) 
UNTIL norm_of_error < tol 


Figure la. The modified Jacobi Implementation. 


REPEAT 
Jacobi(...) 
check_stopping_criterion(...) 
UNTIL norm_of_error < tol 


Figure Lb. Traditional Jacobi Implementation. 
The function expected_number_of_iterations() is 
used to give an initial estimate of the number of iterations 
needed. The decision will be based on the nature of the 
problem and the convergence rate of the algorithm. The 
function evaluate_n() estimates the number of iterations 
needed to achieve the required accuracy. 

Unravelling the FORALL loops yields considerable po- 
tential for parallel execution. Hlowever, it should be noted 
that almost 70% of a typical iterative program, coded using 
the U-Interpreter principles, is synchronization overhead 
related to the interpreter itself. | 

The “overhead/synchronization” actors are the target of, 
our proposed scheme. In an iterative algorithm, the cur-, 
rent iteration depends on all or part of the previous itera- 
tion. This means that in a data-flow environment, detec-: 


tion of data dependencies remains at the level of instruc- 
tions, thereby allowing maximum pipelining among the it- 
erations. For example, as soon as xi has been calculated, 
the next iteration k+ 1 can be initiated without awaiting 


the whole production of the vector awl), 


3.1.2 Estimating the number of iterations - 


The implementation of the evaluaten() function is ap- 
plication dependent. For our experiments we used a func- 
tion based on the observed convergence rate of the algo- 
rithm. Testing the stopping criterion consists of first cal- 
culating the distance d, = ||x(°-)) — x|| between the 
n't approximation and the previous approximation. If this 
distance d,, is less than the desired value tol, the execution 
terminates. Otherwise, it proceeds to the next iteration. 
The reduction coefficient A. + 


[|x(r-)) _ x (2-2) | | 


[xO — xD] 


die. 1 


dy 


Rog = (2) 
indicates how many times the distance d, at iteration n 
has been reduced w.r.t. the distance at iteration n — l. 

Sometimes, a converging iterative process oscillates at 
the first few steps. To compensate for this phenomenon, 
the reduction coefficient R,s can be estimated by averaging 
the effect of ¢ iterations: 


(3) 


Assuming that the reduction coefficient is uniform through 
the iterative process, we should expect that. 


ali. log( 2 
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and finally 
(5) 
Equations 2 to 5 form the basic structure of the function 


evaluate_n(...). Analytical proof for this function is be- 
yond the scope of this paper. 


new n = [k| 


3.2 Simulation Results 


Both the Jacobi algorithm and our modified Jacobi im- 
plementation were evaluated for various problem sizes and 
machine configurations. Simulations were performed for 
problem sizes 3 x 3 to 32 x 32. The results, for the 8 x 8 
and 32 x 32 systems, in terms of simulation time and 
speedup, are shown in Table I. The numbers shown under 
the “Repeat” column correspond to the traditional imple- 
mentation of the Jacobi method and the ones under the 
“Forall” heading correspond to the modified Jacobi. Actu- 
ally in order to access the effect of the Forall insertion only, 
we further modified our modified Jacobi Implementation. 
The inserted FOR loop (FORALL) is performed only once. 
The total number of iterations was known beforehand and 
“forced” into the program graph. The stopping criterion in 
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Table I: Simuation results for the traditional implementation 
of Jacobi (Repeat) and the modified implementation (Forall). 
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Figure 2: Speedup vs # of Pes for the 8x8 and 32x32 
problem for both Jacobi and Modified Jacobi Implemena- 
tions. 


the traditional implementation of the algorithm was cho- 
sen in such a way that exactly 10 iterations were neces- 
sary. This was done for all problem sizes. In other words, 
all the results shown in Table I correspond to an execu- 
tion of 10 iterations. Also, the stopping criterion (check if 
|x) —x(*-)1) < tol) is inside the For loop unlike the mod- 
ified Jacobi implementation of Figure la. Therefore both 
implementations (Repeat and Forall) are identical, the only 
difference is that the FORALL construct will initiate all 
the iterations from the beginning. This indeed isolates and 
exacerbates the effect of the FORALL insertion. 

The speedup vs. # PE’s curve, for the 8 x 8 and 32 x 32 
problems is shown in Figure 2. This plot shows clearly that 
the implementation with the FORALL outperforms the 
traditional implementation throughout the whole space. 

However, although the FORALL implementation out- 
performs the traditional implementation over the whole 


space of experiments and it even projects higher margins 
of improvement, it is still not as efficient as expected. This 
modified scheme targets the “overhead/synchronization” 
actors introduced by the U-interpreter. Since this con- 
stitutes about 70% of the total graph, the improvement 
should have been higher. Among all possible factors, we 
examine the effect of the fact that the critical path is get- 
ting no special treatment. The critical path in this context 
is the data dependencies among successive iterations, 1.e., 


computation actors. 


4 Mechanism for Priority Handling 


By completely unraveling more than one iteration at a 
time, more parallelism is exploited because we have the 
overhead/synchronization actors of all n iterations initi- 
ated from the beginning. However the actors belonging to 
the actual computation (for the rest of this paper they will 
be referred to as computation actors while the rest will be 
referred to as synchronization actors), get no special treat- 
ment. Therefore, at any given time t they have to compete 
with the synchronization actors for machine resources. The 
probability of a computation actor belonging to iteration 2 
at time t to be allocated a specific resource 7 is given by: 


Ci,,(t) 
jar S5r(E) + C5 (E)) 


where S;,(t) is the number of synchronization actors of it- 
eration j at time ¢ competing for resource r, and C;,(t) 
is the number of the computation actors, also belonging 
to iteration 7 at time t competing for resource r and fi- 
nally n is the number of active iterations. In other words 
the numerator of the r.h.s. of equation 6 is the number of 
computation actors waiting for resource r, and the denom- 
inator is the total number of actors waiting for resource r. 
Therefore the expected wait F,(1,t) time for any compu- 
tation actor belonging to iteration 2 to get hold of resource 


P(i,r,t) = (6) 


rat any time t is 


jai Sin(t) + Car(t)) 


E,(t,t) = (2Ge a) = (7) 
Calculating the expected duration of each iteration analyt- 
ically is not a trivial matter; as demonstrated by equation 
6 and 7, it is very complex to estimate how long it will 
take to gain access to a single resource. However it is clear 
that if there are many more synchronization actors than 
computation actors (per iteration ) the expected wait time 


for a resource will be high. 


4.1 


The modified Jacobi implementation is motivated by the 
fact that otherwise idle processors can be kept busy by 
dealing with future iteration actors. However as suggested 
in the previous section the actors of the future iterations 
are actually competing with the actors of the current iter- 
ation. It has been shown in [6] that this indeed extends 
dramatically the lifetime of the early iterations. Therefore 


The priority algorithm 
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some sort of priority hierarchy is needed to ensure that the 
early iterations are not delayed. A good candidate to be 
used as a priority field is the tag associated with each token. 
The u.c.s.z. tag can be mapped by a one-to-one functional 
f :tag — N. This means that by sorting the tags of the 
tokens in the firing queue (or any other queue), we can 
guarantee that the ordering imposed by the programmer is 
observed. 

If no such strict priority is required then, the preference 
to tokens expected to be needed first can be enabled by 
using the iteration part 2 of the tag w.c.s.2. of the out- 
ermost level. Whether a loop is incrementing (For i=7g,n 
where n> ig) or decrementing (For i=un,t, where n> i,) the 
iteration part 2 of the tag is always increinenting. There- 
fore, if tokens with lower iteration values have priority over 
other tokens with higher iteration values, the coimpetition 
for resources remains among actors belonging to the same 
iteration. In short, the priority mechanism is: 


Tokens with lower tag iteration identifier 7 at 
the outermost level of their tag have priority over 
other tokens. 


This policy works well with various kinds of loop con- 
structs. However, more complex analyses and policies may 
be required for more complex graphs. 


4.2 Simulation Results 


Having successfully tested the influence of both the inserted 
FORALL and the priority mechanism [6], we proceed by 
testing the entire modified Jacobi implementation with the 
priority mechanism enforced. The priority was enforced at 
the outer FORALL loop (the inserted loop). The same 
problem sizes, as in the previous Section, were investi- 
gated. Simulations results for the 8 x 8 and 32 x 32 systems 
are shown in Tables II and III. The results under column 
“FOR+PRI correspond to our modified algorithm imple- 
mentation with the priority policy enforced. The manner 
in which the speedup is calculated for this set of results dif- 
fers from the definition of the speedup used for the previous 
set of results. Rather than comparing the performance of 
the algorithm using multiple PEs with that of the same al- 
gorithm using a single PE, we compare the performance of 
the algorithm using multiple PEs with the performance of 
the standard unmodified algorithm using a single PE. This 
was done because the object under evaluation was not the 
architectural model but the modified Jacobi implementa- 
tion. 

Figure 3 shows the plots for 8x8 and 32 x 32 for both the 
traditional implementation of Jacobi and the modified im- 
plementation with the priority mechanism enforced. The 
modified implementation (FOR+PRI) outperforms the tra- 
ditional implementation throughout the whole space. The 
results for the rest of the problem sizes were similar to the 
ones presented. The speedup enhancement of the modified 
over the traditional implementation is in the range of 3. 

Overall, the simulation results show that the proposed 
modification to the algorithm and the priority mechanism 


Table I: Simulation time and speedup (S) for the 8x8 matrix 
and n=40,20,10,5 


REPEAT FOR+PRI 


Table III: Simulation time and speedup for the for the 32 x 32 


matrix and n=15,5,3 


provide very good enhancement to the performance of the 
iterative algorithms. For “real life” applications with prob- 
lem sizes in the range of 10* — 10* the enhancement in 
speedup should be much higher. The interested reader can 
refer to [6] for a more complete and detailed presentation 
and analysis of the results. 


5 Conclusions 


In this paper, we have identified an important source of 
inefficient operations in data-driven machines. We have 
presented a program graph optimization scheme which can 
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Figute 3: Speedup vs # of PEs for the 8x8 and 32x32 
problem for both Jacobi and Modified Jacobi Implementa- 
tions with the priority Mechanism enforced. 
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be applied to iteratively-based algorithms in dynamic data- 
flow environments. In our scheme, we rewrite the conven- 
tional WHILE (also referred to as REPEAT) operator and 
replace it by a WHILE/FORALL construct. This allows 
a more efficient “block” execution which bypasses much of 
the overhead traditionally associated with data-flow coin- 
putations by enabling a certain amount of “look-ahead” on 
the termination criterion. We have verified our scheme by 
a combination of analytical and deterministic simulation 
means. We have shown that the speedup of the modified 
implementation is considerably enhanced over the tradi- 
tional implementation. This is brought about by the larger 
amount of instructions which can be executed concurrently 
in an asynchronous fashion. 

However, it has been discovered that this very “anar- 
chy” in the scheduling of operations would underutilize 
the resources by favoring “low-yield” operations (i.e., those 
overhead instructions which spawn few other actors in the 
“computation” part of the program while conversely en- 
abling more “overhead/synchronization” actors: a case of 
bureaucratic folly!). In order to make sure that the compu- 
tation work would be performed immediately, possibly at 
the expense of overhead work in higher iterations, we have 
hence developed a hierarchy mechanism which tends to give 
higher priority to the execution of instructions in lower it- 
erations. Both analytical and simulation results show that 
the priority mechanism reduces the lifetime of the individ- 
ual iterations of the modified algorithm. This yields con- 
siderably better resource utilization and faster execution. 
Higher performance is achieved as a direct combined effect 
of the modified algorithm and the priority mechanisin. 
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Abstract 


In this paper, we propose two graceful degra- 
dation schemes for two-dimensional wavefront ar- 
rays. The first scheme is the static dataflow scheme 
which can be applied to both run-time and compile- 
time applications. The second scheme is the dy- 
namic dataflow scheme which is mainly used for 
run-time applications. For the static scheme, top- 
ics discussed include program complezity and load 
balancing. For the dynamic scheme, the focus is on 
the routing control. Without broadcasting, a dis- 
tributed routing algorithm which is self-adaptive to 
different faulty patterns is developed. An upper 
bound analysis of the system survival probability 
for both schemes is presented. Results of Monte- 
Carlo simulations for both schemes are compared 
and the tradeoff between fault-tolerant capability 
and hardware complezity is explored. 


1 Introduction 


VLSI/WSI processor arrays have regular and modular 
structures that match the computational requirements of 
most signal and image processing algorithms. Their par- 
allel/pipelined processing characteristics will satisfy the 
very high computational throughputs in real-time appli- 
cations. However, it is almost impossible to guarantee 
that an array with a large number of processing elements 
(PEs) will have all the PEs running correctly in a mis- 
sion time. Therefore, fault-tolerant techniques must be 
incorporated into.these systems. A desired objective of a 
fault-tolerant design is to maximize reliability while min- 
imizing the corresponding hardware and time overhead. 

Fabrication defects and operational faults on wafers 
are inevitable in today’s IC technology. The motivation 
for incorporating fault tolerance in VLSI/WSI processor 
arrays is two-fold: yield enhancement in fabrication 
time and reliability improvement in run-tzme. As to 
yield enhancement, the fabrication defect problem can be 
solved by using static restructuring techniques [9] to con- 
-nect the good components. 
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As to reliability improvement, operational faults have 
a much lower probability of occurrence as compared to 
production defects. Reconfiguration and graceful degra- 
dation have been used to deal with operational faults. A 
host-driven fault-stealing reconfiguration method is pro- 
posed by Sami and Stefanelli [11] to replace faulty PEs 
with good spare PEs. On the other hand, a distributed re- 
configuration algorithm is proposed in [5]. In these meth- 
ods, spare PEs are used and thus the size of the physical 
array is greater than the size of the logical array. When 
the logical array size is equal to the physical array size, 
graceful degradation techniques, which use tume redun- 
dancy instead of space redundancy, should be applied. The 
row/column elimination method by Fortes and Raghaven- 
dra [2] is a typical example. Their design uses switches to 
bypass the whole row (or column) of a faulty PE and thus 
reduce the size of the logical array. Since the size of the 
logical array is reduced, the algorithm needs to be recom- 
piled and the host computer is involved. A lot of time may 
be consumed in the recompilation and the propagation of 
data/control signals between the host computer and the 
array. This paper propose graceful degradation methods 
which require no recompilation of algorithms. 

In VLSI array processing, it is critical to avoid large 
clock skews in synchronizing systolic computing network. 
A simple solution is to take advantage of the dataflow 
computing principle such as in wavefront array processing 
[6]. Conceptually, a wavefront array equals a systolic array 
plus static dataflow computing. Thus the requirement for 
correct timing in the systolic array is now replaced by a 
requirement for correct sequencing in the wavefront array. 
Graceful degradation schemes are proposed in this paper 
for two-dimensional wavefront arrays. 

The paper is organized as follows: In Section 2, the 
array topology and fault assumptions are described. In 
Section 3, an upper bound analysis of the system survival 
probability is presented. Distributed adaptive routing al- 
gorithms with static and dynamic dataflow are described 
in Sections 4 and 5, respectively. Finally, we summarize 
some comparisons in Section 6. 


2 Two Dimensional Grid Network 
Model 
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Topology Most signal and image processing algorithms 
are dominated by filtering, transfer techniques, and some 
key linear algebraic methods. These algorithms, possess- 
ing common properties such as regularity, recursiveness 
and locality, can be efficiently computed in processor ar- 
rays with mesh-type interconnections. For example, Fig- 
ure 1(a) shows some arrays with mesh-type interconnec- 
tions. Different algorithms may require different kinds 
-of interconnections. An array which exactly match the 
requirement of an algorithm is called a logical array. Ap- 
parently, the array that people build, the physical array, 
can not be the same as all the different logical arrays. 
In this paper, a torus topology (or called k-ary 2-cube 
interconnection network [1]), as shown in Figure 1(b), is 
provided for the physical array. Note that in a torus array, 
the failure of boundary PEs can be treated the same as 
that of internal PEs. From the implementation point of 
view, the wraparound interconnections of a torus are fea- 
sible on PC boards and wafers. Since adjacent PEs on the 
array are interconnected directly, no external switches [4] 
are required among PEs. Thus some VLSI area and hard- 
ware design work can be saved. Furthermore the model 
requires no global wires like that used in the host-driven 
global reconfiguration technique [10] or that used in the 
row/column elimination method [2]. In our model, bidi- 
rectional information flows are allowed in the logical ar- 
ray. Compared to unidirectional information flow as in 
[11], this model will have broader applications. 
In the proposed grid network, 


1. Each PE is self-tested (including computation and 
routing parts) and its test results are transmitted to 
adjacent PEs. Note that these fault szgnaling wires 
are not shown in Figure 1(b). 


. Since a faulty PE may contaminate data which pass 
through it, a PE will disconnect the communication 
links between itself and a neighbor faulty PE. This 
will force data flow only through good PEs and links. 


. When the system is failed by one part of the array, a 
system failure signal should be generated so that the 
host and the other part of the array can be notified. 
This system failure signal is very critical and a bad 
method would result in a very unreliable system. A 
double checking procedure should be used to make 
sure the signal is not from a faulty PE. The adop- 
tion of global links is not a good way to solve this 
problem from this point of view. Thus the system 
failure signal is also propagated locally with O(N) 
time penalty for an array with size N x N. PEs 
which receive a system failure signal is responsible 
for executing the double checking procedure. 


Fault Assumptions In this paper, we make the follow- 
ing fault assumptions: 


1. The self-testing part is fault free. Some kinds of 
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Figure 1: Two dimensional processor array grid network 
model. (a) Logical arrays, (b) Physical array — Torus. 


hardware redundancies, e.g., TMR, may be incor- 
porated to assure this assumption. 


. The communication links are fault free. Some error- 
correcting-code techniques may be incorporated so 
that this assumption can be made. 


. Fault signaling wires to neighbor PEs are fault free. 
Because the information transmitted over each wire 
is only one bit, a robust design can be developed 
without causing too much hardware overheads. 


4, System failure signaling is fault free. 


Algorithm Matching Many algorithms, when executed 
on a processor array, require more communication links 
than what a torus array can support. For example, a 
southeastern link may be required for some algorithms. 
The problem of mapping a logical array to a physical ar- 
ray is called a matching problem. In order to implement 
such algorithms on a torus array, data which can not be 
transmitted over direct links should route through several 
links to reach their destinations. 


1. If synchronous design is to be adopted, then the 
communication links have to be time-shared in or- 
der to accommodate all the data links required by 
algorithms. A time-shared scheme for the communi- 
cation links was proposed in [7]. Their idea is to use 
buffers of different size for different data so that the 
data usages of the link can be time-multiplexed. Be- 
ing able to determine the buffer size, they are able to 
use synchronous arrays to handle those algorithms. 


. If asynchronous design is to be adopted, then either 
static or dynamic dataflow schemes can be used. 


e A data token may route through several com- 
munication links to reach its destination. But 
the sequence of transmitting multiple data to- 
kens within each PE is predetermined and can 


thus be incorporated into the program within 
each PE. This is a static dataflow approach. 


All data tokens can be tagged with information 
about their destination PEs. In this case, data 
tokens reach their destination via some routing 
algorithms. The routing path is not predeter- 
mined. This is a dynamic dataflow approach. 


In this paper, only asynchronous designs are consid- 
ered. 


Two Steps in Graceful Degradation For a graceful 
gradable dataflow array, two steps are involved. 


1. Once a PE is faulty, its task should be reassigned to 
a good PE. This is called job reasstgnment. 


. Once job reassignment is fixed, data routing is needed 
to implement the logical array on the faulty phys- 
ical array. Two routing schemes are discussed in 
this paper, namely, static and dynamic dataflow ap- 
proaches. 


Job Reassignment and Upper 
Bound Analysis 


Job Reassignment Job reassignment strategies directly 
affect the control complexity of PEs and the system fault- 
tolerance capability. In general, the more flexible the job 
reassignment is, the more complicated the PE hardware 
will be. However, the increase of reassignment flexibility 
also increases the fault-tolerance capability. To simplify 
the discussion, the job handled by each PE is assumed 
to be non-breakable. Since the slowest PE in an array 
determines the array throughput, jobs of different faulty 
PEs should be reassigned to different good PEs. Here we 
focus on the design with fixed reassignment, i.e., the job 
of faulty PE is always assigned to its left neighbor PE. 
Thus a good PE possesses at most two PE jobs. 


Definition: The system survival probability, Ps, is the 
probability that the system (array) works with tolerable 
degradation in performance given that the system works 
initially. 

If r is the PE reliability, i.e., the probability that a 
PE is good (given that the PE is good initially), then for 
an array with size N x M, 


NxM 
Ps = 2 pin er = r)'D; 


1=0 


(1) 


where D; is the number of successful faulty patterns with 


2 faulty PEs. 


Upper Bound Analysis of Ps In our scheme, the task 
of a faulty PE is always reassigned to its left neighbor 
PE. Thus no two adjacent PEs in the same row will be 
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Figure 2: Upper bound for system survival probability vs. 
array size. 


allowed to be faulty at the same time. Here we want to 


calculate the probability that a system fails due to the 


existence of two adjacent PEs in the same row. First, a 
linear array is considered. The result is then extended 
to a ring array. Finally, results for two-dimensional torus 
arrays are obtained. 


1. Assume there are M PEs in a linear array and q of 
them are faulty; denote the number of successful 
reassignments as f(M, q). Then f(M, q) can be 
computed from the following recurrence equations. 


M>2, |¥|>q>1, 
f(M,0) = 1 M>1, 
{(M,1) = M M>2, 
f{(M,q) = 0 otherwise. 

2. If the array structure is a ring, the number of suc- 
cessful reassignments, g(M, q), can be computed 
by noting the following relation between f{.,-) and 

g(M,q) = f(M —1,q) + £(M — 3,q) 
3. Assume that the array size is N rows and M columns 


in our torus model. If there are q faulty PEs and 
denote the number of successful reassignments as 


h(N, M, q), then 


q 
h(N,M,q) =) h(N-1,M,q—k)g(M,k) N >2, 
k=0 
(2) 
with h(1,M,q)=g(M,q). 


Once we obtain h(V, M, q), we may compute Ps, 


NxM 
Ps = Ss h(N,M, gyn Nx™)-49(4 —r)? 


q=0 


(3). 


Let’s assume the 2-D array is square (2.e., N=M) and 
compute Ps. The result is shown in Figure 2. 


The accurate performance of the model depends on 
achievement of both job reassignments and token routings. 
In the above analysis, we consider successful job reassign- 
ment cases without any routing consideration. Hence the 
results in Figure 2 are the upper bound performance. 


(a) 


Figure 3: The communication patterns: (a) before the 
fault occurrence; (b) after the fault occurrence and task 
reassignment. | 


4 Static Dataflow Scheme 


For a wavefront array, the dataflow sequences within each 
PE are predetermined and can be handled by the program 
within each PE. Here we propose a static dataflow graceful 
degradation scheme. 


Definition: An N(a,b) neighbor regionofaPEisaaxb 
region with the PE as the center of the region. Here a and 
6 are odd integers. 


First let’s consider the case with single faulty PE. | 


The tasks originally handled by a faulty PE can be shared 
by PEs on its N(3,3) neighbor region. The load sharing 
scheme can be predetermined and thus can be handled 
by programs. Note that PEs located outside the N(3,3) 
neighbor region need not change their dataflow sequences. 

The more complicated faulty patterns we try to han- 
dle, the more complicated the program would be. To sim- 
plify the programming, we propose to handle only cases 
where no two (3,3) neighbor regions of faulty PEs 
overlap. 


Program Complexity A comparison of communica- 
tion patterns as shown in Figure 3 illustrates the com- 
plexity of the program to be written into each PE. Once a 
fault occurs, PEs on the N(3, 3) neighbor region should be 
notified. Thus eight error signaling wires should be used 
for each PE. Furthermore, when a fault occurs, the rout- 
ing functions of PEs in the N(3,3) neighbor region are 
different. In Figure 3(b), there are seven PEs with differ- 
ent faulty routing functions and one PE (the one on the 
right-upper corner) with the normal routing function. To 
adapt to different faulty patterns, each PE must be able 
to execute eight different routing functions, i.e., the nor- 
mal function and seven emergency functions. These eight 
functions are predetermined and can be implemented ei- 
ther by hardware, some kind of router, or by software 
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programming.! 7 

Note that arrays with different number of faulty PEs 
possess almost the same throughput since their slowest 
PEs have the same number of PE jobs. If a PE job can 
be shared by several PEs, then load balancing should be 
taken into account. Since load balancing can be predeter- 
mined and handled by programs, the degradation of ar- 
ray throughput can be minimized without increasing the 
hardware complexity. 


Monte-Carlo Simulation In this scheme, Ps is the 
probability that, in the array, no two N(3,3) neighbor 
regions of faulty PEs overlap. Thus a faulty pattern is 
successful if and only if no two N(3,3) neighbor regions 
of faulty PEs overlap. A necessary and sufficient condition 
for a faulty pattern to be successful is all the faulty PEs 
are located outside the N(5,5) neighbor regions of other 
faulty PEs. | 

To estimate P,, a Monte-Carlo simulation was per- 
formed for different array size (N) and different PE reli- 
ability (r). In the simulation, to estimate D; (cf., Eq. 1), 
100,000 random faulty patterns, each with 7 faulty PEs, 
are used and the number of successful faulty patterns are 
counted. In this way, we estimate D; and then use Eq. 1 
to compute P,. The results for different PE reliability, r, 
are shown in Figure 4. Note that for a system without 
fault-tolerance capability, the system reliability is r™*% 
and is also shown in Figure 4. 


Compile-Time Environment As we can see, Ps for 
this scheme is not very attractive. The problem is the 
requirement of non-overlapping N(3,3) neighbor regions 
to reduce the complexity of the program residing in each 
PE. If higher Ps is required for run-time environment, 
the dynamic dataflow scheme, as explained later, may be 
used. However, if the compile-time fault tolerance is to 
be considered, then N(3,3) neighbor regions need not be 
non-overlapping and Ps can be improved drastically. This 
is explained below. 

Compile-time fault-tolerance is for arrays which are 
designed as “general purpose” machines with array com- 
pilers to compile array programs into PE programs. Users 
write programs in the array level and produce array pro- 
grams. To execute an array program, the array is first ex- 
amined to see if there are any faulty PEs. Then, adaptive 
to different faulty patterns, the array compiler produces 
PE programs. At last, PE programs are loaded into the 
array and executed. 

In this case, the error detection time can be longer 
(compared to run-time environments), more complete fault 
location operations can be enforced, and communication 
link errors may be detected (i.e., the fault assumptions 
can be reduced). The PE programs loaded into PEs may 
be different. But each PE needs only one PE program 
disregard the faulty pattern. In this case, the faulty pat- 


‘Some checkpoints or resynchronization are required to restart PEs 
on the same neighbor region. 
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Figure 4: System survival probability with/without static 
graceful degradation. 


BOUNCE 


1 IN-PORT DATA 
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Figure 5: Protocol of data token. 


tern can be more complicated without greatly increasing 
the complexity of each PE’s program. Thus the system 
can handle some faulty patterns with overlapping N(3, 3) 
neighbor regions of faulty PEs and drastically increase Ps. 


5 Dynamic Dataflow Scheme 


An array with dynamic dataflow is an array whose data 
are tagged with some header. By extending the header, 
graceful degradation may be achieved. 


A data token definition (see Figure 5) for fault-tolerance 


can be stated as follows: 


1. X/Y field: used to denote the relative position of 
the current PE (which the token resides) to the des- 
tination PE before reassignment.’ If the current 


Since all fault signaling wires are for local communication, the 
source PE may not know the failure of the destination PE. Thus the 
location of the destination PE before reassignment must be used. 
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PE and the destination PE locations are (7,7) and 
(tp, Jp) respectively, then x/y, the value of the X/Y 
field, is defined as t = 71 —tp and y =7 — jp. 


IN-PORT field: used to denote the original input 
port of the destination PE. In a torus array, two 
bits are required to distinguish the four input ports 
(see Figure 1(a)). If the links in the array are unidi- 
rectional, one bit is sufficient to distinguish the two 
input ports. 


BOUNCE field: used to indicate the number of times 
the token is kept away from the destination PE (be- 
cause of the existence of faulty PEs). 


Routing Algorithm The basic idea is to locally up- 
date the X/Y field so that the destination PE is gradually 
approached. A 0/0 value in the X/Y field means the des- 
tination PE is reached. Then the IN-PORT field can be 
used to distinguish the source PE. If the destination PE 
is faulty, then the data token will reach the reassigned PE 
with its X/Y field as —1/0. Note that the X/Y field can 
be used for the reassigned PE to distinguish whether the 
token is for itself or for its faulty right neighbor PE. 

When a faulty PE is encountered during a routing, a 
data token might be kept away from its destination PE. 
The BOUNCE field is used to constrain the number of 
“bounces”, i.e., the action that forces a data token to leave 
its destination PE. For example, the number of bounces 
cannot be more than 3 if 2 bits are used for the BOUNCE 
field. When the number of bounces for a data token is over 
3, the system is declared “failed”. Apparently, the larger 
the number of bits for the BOUNCE field, the higher the 
number of faulty PEs the system can tolerate and thus 
the higher Ps. The purpose of the BOUNCE field is to 
avoid infinite loops which will be explained later. 

In the routing algorithm, no backtracking 1s allowed. 
That means once leaving a PE, a data token is not allowed 
to return to the PE immediately. This is to avoid some 
useless routing steps. 


Loop Free Requirement Because of the lack of global 
information, most distributed routing algorithms need to 
solve the problem of infinite looping [8,3]. That is, tokens 
may be trapped in some loops forever. In a fault-tolerant 
array, data token may be trapped in infinite loops for some 
peculiar faulty patterns. 

To solve the infinite loop problem, a straightforward 
way is to use an AGE field in the data token to indicate 
the number of links the data token has traveled through. 
Apparently, if an upper bound is set on the AGE field 
of a token, it is impossible to have an infinite loop. The 
problem with this scheme is too many bits would be used 
for the AGE field. By using the BOUNCE field, the pro- 
posed algorithm can avoid infinite loops with less number 
of bits. 


System Survival Probability 


W2z3, B=3 
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Figure 6: System survival probability for dynamic 
dataflow scheme with W=3 and B=3. 


Monte-Carlo Simulation Let W be the number of 
bits in the X/Y field. and B be the number of bits in 
the Bounce field. To take routing into account, a Monte- 
Carlo simulation was performed where, for each specific 
case (with specific array size and specific number of faulty 
PEs), 100,000 random faulty patterns were used. In the 
simulation, the unidirectional mesh array communication 
is assumed to be the logical array. The simulation results 
for W=3 and B=3 are summarized in Figure 6. 


Tradeoff Since W constrains the range which tokens 
can flow and B sets an upper bound on the number of 
faulty PEs which can be tolerated, increasing W and B 
will improve the routing capability and Ps. However, the 
hardware cost is increased accordingly. Monte-Carlo sim- 
ulations were made to illustrate the system survival prob- 
abilities and the results for four cases (W=2, B=2; W=2, 
B=3; W=3, B=2; and W=3, B=3) are summarized in 
Figure 7. 

Note that Ps for the case W=3 and B=3 are very 


close to the upper bounds. It means that very few routings 
fail and thus no more bits should be used for W and B. 


Communication Overhead To illustrate the commu- 
nication overhead of the routing scheme, we define the 
marzimum routing distance for a successful faulty pattern 
as the length of the longest routing path. The average 
mazimum routing distance of an array is the average of the 
maximum routing distances over all the successful faulty 
patterns. This parameter indicates the involved commu- 
nication overhead and can be expressed as | 

io ( NxM 


2 


Lyr N44 _ rj 
t=0 
where L; is the average maximum routing distances for 
successful faulty patterns with 2 faulty PEs and can be 
estimated by simulations. Table 1 shows the simulation 


results for the average maximum routing distance. 
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Upper Bound 
W=3, B=3 
W=3, B=2 
W=2, B=3 
W=2, B=2 
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Figure 7: System survival probabilities for cases with dif- 


ferent W and B. 


Array 
size He = es ne 


Table 1: The average maximum routing distances. 


6 Conclusion 


Two graceful degradation schemes are proposed in this pa- 
per for wavefront arrays in run-time applications. A grace- 
ful degradation scheme without tagging data is proposed 
for the static dataflow array. For the dynamic dataflow 
array, a scheme which extends the header of data token is 
stated. An upper bound analysis of the system survival 
probability for both schemes is presented. Simulations 
were made to estimate the system survival probabilities 
of both schemes. It is found that the dynamic dataflow 
scheme exhibits higher system survival probabilities (see 
Figure 8) which are very close to the upper bounds. This 
is at the expense of higher hardware complexity. In the 
compile-time environment, the system survival probability 
can be improved for the static dataflow scheme by relax- 
ing the constraints that no two N(3,3) neighbor regions 
of faulty PEs overlap. It is noted that although the array 
compiler will be somewhat more involved, the hardware 
complexity remains the same. 
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Abstract 


Much effort has been expended on developing special ar- 
chitectures dedicated to the efficient execution of produc- 
tion systems. While data-flow principles of execution offer 
the promise of high programmability for numerical com- 
putations, we demonstrate here that the data driven prin- 
ciples can also be applied to symbolic computations. In 
par’icular, we consider a mapping of the Rete match algo- 
rithm on the MIT Tagged Token Data-flow Architecture. 
The results of a deterministic simulation of this multipro- 
cessor architecture demonstrate that artificial intelligence 


production systems can be efficiently mapped on data- 


driven architectures. 


1. Introduction 

In rule-based production systems, it is often the case that 
the rules and the facts needed to represent a particular 
production system in a certain problem domain would be 
_very large. It is thus known that simply applying software 
techniques to the matching process would yield untoler- 
able delays. Indeed, as [F082] has pointed out, the time 
taken to match patterns over a set of rules can reach 90% 
of the total computation time spent in expert systems. 
The need for faster execution of production systems has 
spurred research in both the software and hardware do- 
mains. The conventional control flow model of execution 
is limited by the “von Neuman bottleneck” [Ba78]. Ar- 
chitectures based on this model cannot easily deliver large 
amounts of parallelism [A183]. The data driven model of 
execution has therefore been proposed as a solution to 
these problems. These principles have been surveyed by 
[Ga87|. The purpose of this paper is to demonstrate the 


applicability of data-flow principles of execution and of 


architecture design to the solution of artificial intelligence 
(AI) oriented problems. For this purpose, a subset of pro- 
duction systems problems, the Rete match algorithm has 
been chosen. 

Section 2 briefly introduces production systems and the 


Rete algorithm. Section 3 discusses mapping the Rete 


*This material is based upon work supported in part by the Na- 
tional Science Foundation under Grant No. CCR-8603772 and by the 
USC Faculty Research and Innovation Fund. 
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algorithm to data-flow architectures. There, we iden- 
tify the problems associated with the Rete algorithm in 
a multiprocessor environment and give solutions to these 
problems through the allocation and distribution policies 
we have developed. In section 4, simulations are carried 
out and performance observations obtained for a data- 
driven environment are compared to those of a conven- 
tional control-flow approach. Concluding remarks as well 


as future research topics are discussed in section 5. 


2. Production systems and the Rete algorithm 

A production system is a program composed entirely of 
conditional statements called productions or rules. The 
left hand side (LHS) is the condition part of a production 
rule, while the right hand side (RHS) is the action part. 
The collection of all the production rules in a produc- 
tion system forms a rule base, called a production mem- 
ory. The productions in the production memory operate 
on a working memory which is a set of assertions called 
Working Memory Elements (WMEs). Both patterns and 
WMEs have alist of elements, called Attribute Value Pairs 
(AVPs). The value of an attribute can be either fixed (in 
lowercase) or variable (in uppercase). 


Production Memory 


Rule 1 Rule 2 
(a Z) (b Y)| (p 1) (q 2) (r X)| 
[(c X) (d Y)} [(c X) (d W)| 
(> 1) (42) (FX) (5) (m6) (x W) (02)] 
— — 


[Modify (c Y) (d X)] 


[Remove Ist pattern] 


Working Memory 


i: [(p 1) (q2)@*)] 4: [(1.5) (m6) (n 6) (0 2)] 
2: [(p 1) (a +) (¢ =)] 5: (1.5) (m 7) (n 6) (0 @) 
3: [(a +) (b 6)| 6: [(c *) (d 6)| 


A typical execution cycle of production systems is com- 


4: 
oe 


posed basically of three steps: matching, conflict resolu- 
tion, followed by rule firing: In the matching cycle, the 
LH5s of all the production rules are matched against the 
current WMEs to determine the set of satisfied produc- 
tions. The conflict resolution cycle selects one production, ° 


if the set of satisfied productions is non-empty. The ac- 
tions specified in the RHS of the selected productions are 
performed in the rule firing cycle. In this paper, we limit 
ourselves to the matching step only since it takes most 
of the computation time in the evaluation of production 
systems. 

The Rete match algorithm [Io82] is one of the best 
known approaches used in the matching of objects in pro- 
duction systems. It constructs a condition dependency 
network, saves in memory the information concerning the 
changes in the working memory between production cy- 
cles, and then utilizes them at a later time. This is based 
on the observation, called temporal redundancy [BF85], 
that there is little change in the working memory between 
production cycles. The Rete algorithm further reduces the 
matching time by sharing identical tests among produc- 
tions. It stems from the fact that the productions have 
many similar or identical parts, called structural simuilar- 
ity. A condition-dependency network for the two rules 
listed above has been constructed in Figure 1. The net- 
work consists of several types of nodes: root node, one- 
input nodes, two-input nodes, negated two-input nodes 
and terminal nodes (see [Fo82] for details). 


3. 
rithm 

In this section, we identify the necessary mapping schemes 
to suit the Rete match algorithm and the data-flow multi- 
processor. Bottlenecks in the Rete algorithm are identified 
and possible solutions are suggested. 


Data-flow implementation of the Rete algo- 


3.1. Suitability of the data-driven execution model 
Executing the Rete algorithm on a data-flow multiproces- 
sor has many advantages over execution on a conventional 
control-flow computer: First, the execution principles of 
the Rete algorithm are driven by incoming data tokens, 
i.e., execution may proceed whenever data are available. 
In any situation, multiple firings of actor in data-flow and 
comparison tests in the Rete algorithm are possible. Sec- 
ond, both are based on the single assignment principle, 
t.e., no data modifications except arrays. Third, both a 
data-flow machine and the Rete algorithm need depen- 
dency graphs. Fourth, the requirement for the memo- 
rization capability in two-input nodes of the Rete algo- 
rithm assumes a good structure handling technique and 
this can be effected by using the [Structure Controller 
in Arvind’s Dynamic Data-flow machine. Finally, the dy- 
namic data-flow architecture allows an easy manipulation 
of the counters (see [Fo82]) since the counter for negated- 
pattern processing can be treated the same as other tags 
in the dynamic architectures. 


3.2. The Rete algorithm in a multiprocessor envi- 
ronment 

Mapping production systems onto multiprocessor systems 
has been done in several ways in the recent literature. Di- 
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rect mapping employed by [SM84, St87] for DADO uses 
“full distribution,” which allocates one rule to an avail- 
able PE to achieve the production-level parallelism. In 
[Gu84] a relevancy between the rules and the WMEs is 
identified and used to directly allocate rules to PEs. It 
has been suggested by [Bi85] that the semantic network 
can directly be viewed as a data-flow graph. Each node 
in the semantic network corresponds to an active element 
capable of accepting, processing, and emitting value to- 
kens traveling asynchronously along the arcs. The other 
approach suggested by [TM85] may be considered an in- 
direct mapping. In this approach, all productions are an- 
alyzed and grouped according to the dependency existing 
between productions to enable parallel firing of rules. 

The mapping scheme adopted for our simulation, how- 
ever, is somewhat different from the forementioned ap- 
proaches. The motivation for the choice of an alterna- 
tive method is in two facts: First, the architecture we 
have adopted is based on data-flow principles of execu- 
tion. Since the parallel model employed in this paper ex- 
ploits parallelism at the production level, condition level, 
and further attribute-value pair level, the mapping scheme 
must be eflicient to utilize all the possible forms of paral- 
lelism inherent to both data-flow principles and the Rete 
algorithm. 

Second, the Rete algorithm presents two bottlenecks 
which substantially degrade the performance of the pro- 
duction system in our parallel machine: Since the root 
node distributes tokens one at a time to all PEs, tokens 
will pile up on the input arcs as shown in Figure 2. This is 
due to the fact that rules cannot be copied to all PEs. The 
second inefficiency can also be seen on Figure 2. Assume 
that m tokens are received and matched on the left input 
arc of the two-input node. Further assume that a token is 
received and matched on the other input of the two-input 
node. The arrival of this last token will trigger the invoca- 
tion of m comparisons with the values received and stored 
in the left memory LM3 of the two-input node. On the av- 
erage, there will be O(m) such tests. Should the situation 
have been reversed and n tokens be in the right memory 
RMg, a token on the left side would provoke O(n) com- 
parisons. The internal workings of this two-input node are 
therefore purely sequential. In order to avoid the wasted 
time in searching through the entire memory, an effective 
allocation of two-input nodes and one-input nodes should 
be devised. 


3.3. Allocation of productions 

The allocation policy we are going to use does not follow 
the structural similarity discussed in section 2. Those con- 
dition patterns that are shared by different productions 
are copied and allocated to different PEs. It is based on 
the fact that by copying shared patterns and allocating to 
different PEs the overhead in inter-processor communica- 
tion can be substantially reduced. However, this policy 
will consume a lot of processor space and be costly as 


the number of productions that share patterns or part of 
patterns increases. 

Suppose that n PEs are available. They are logically 
partitioned into ,/n groups, where each group has ,/n 
PEs. Those condition patterns that have 2 AVPs in each 
pattern are allocated to PEs in group 2. Each two-input 
node is split into two memories; left- and right memories. 
Memories are allocated to PEs, where the corresponding 
one-input nodes are allocated. Those memories that have 
no corresponding one-input nodes are are allocated to PEs 
in Group 0. Allocating a memory to a PE will ensure an 
even distribution of processing load across the processor 
space. At the same time, we can realize parallel match- 
ing in condition level. Terminal nodes are not explicitly 
allocated to PEs for our simulation. 

Based on the above allocation policy, the network is al- 
located to PEs, shown in Figure 3. PEs are partitioned 
into 5 different groups, where Group 1 is not used in our 
example since no condition pattern has only one AVP. 
Consider the first pattern of Rule 2, [(p 1) (q 2) (r X)], 
for example. The sequence of nodes in the pattern and 
the left memory for that pattern are labeled 9 through 
12 in Figure 1 (9 through 11 are one-input nodes). Since 
the pattern has 3 AVPs, it is classified into Group 3 and 
allocated to PE, of Group 3, designated PE3,. The sec- 
ond pattern of the Rule 2 has 2 AVPs and right memory, 
labeled 13 through 15. It is classified into Group 2 and 
allocated to PE, of Group 2, designated PE,». In gen- 
eral, the number of PEs needed to allocate productions is 
proportional to the number of inter-element feature tests 
in the productions. 


3.4. Dynamic WMEs distribution 

In order to overcome the bottleneck at the root node 
we propose one scheme which simultaneously distributes 
many different tokens to many PEs at a time if many 
WMEs are available at the same time for distribution. 
WMEs that have z AVPs never match patterns that have 
7 AVPs such that 2 < zy. The network shown in Fig- 
ure 1 is, therefore, modified to a network with multiple 
root nodes, as depicted in Figure 4. Whenever the new 
WMEs that are generated due to the rule firings become 
ready for distribution, PEs distribute WMEs based on the 
group numbers attached to the WMEs. 

Assume that the two rules are compiled and allocated 
to the PEs according to the allocation policy described 
in section 3.3. Suppose further that a set of WMEs 
shown in section 2 is available and is about to be dis- 
tributed into the network. If the Rete algorithm dis- 
tributes one WME at a time to the network through 
the root node in Figure 1, it would take 6 time units 
to distribute them. Furthermore, a number of compar- 
ison tests which are performed at the very first one-input 
nodes (1, 4, 9, 13, and 16) will reach 36 (= 6 PEs x 
6 WMEs). This is depicted in Figure 5(a), where one 
WME at a time is sequentially distributed to all PEs. For 
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example, when WME; is distributed, all 6 PEs to which 
patterns are allocated make a comparison test simulta- 
neously. Only two PEs, PE3»9 and PE3,, will succeed in 
matching. This forces the machine to operate in Single- 
Instruction-stream-Multiple-Data-stream (SIMD) execu- 
tion mode although it has a Multiple-Instruction-stream- 
Multiple-Data-stream (MIMD) pro¢essing capability. 
Applying our distribution policy, the 6 WMEs are par- 
titioned into 3 groups and the group numbers are assigned 
to WMEs. WMEs 3 and 6 get group #2 while 1 and 2 get 
#3 and 4 and 5 get #4. The total number of comparison 
tests performed at the very first one-input nodes in three 
sequences reduces to 12 (= 2x2 + 3x2 + 1x2), as shown 
in Figure 6(b). There are three bins in Figure 6(b), where 
each bin corresponds to a group. In each group, WMEs 
are sequentially distributed to PEs belonging to the corre- 
sponding group in the PE space. However, between groups 
WMEs are simultaneously distributed. The speed-up for 
the distribution policy. would then be 36/12=3 for the 
given set of WMEs. The number of groups in WMEs de- 
termines the speed-up. In the worst case, only one WME 
can be distributed to all PEs at a time as shown in Fig- 
ure 6(a). Note that in the original Rete algorithm, a se- 
quential distribution, analogous to our worst case, would 
be implemented. Instead, our improvement provides the 
extra parallelism although this scheme depends heavily on 


the fact that WMEs will be evenly classified to all groups. 


4. Simulation and performance evaluation 

4.1. Simulation 

In this simulation, various WMEs and rules are used. 
First, one-input nodes and array operations are tested by 
1 PE. Simulation results show that a sequence of one- 
input nodes takes about 15 simulation time units. Each 
additional matching takes 13 time units. Second, three 
conditions of the Rule 2 are tested separately one at a 
time with various WMEs. Simulation results indicate that 
WME 1 matches against WME 6 of RM;; in 76 time units. 
Third, two patterns are executed in parallel by two PEs 
and take about 200-500 time units to match WMEs de- 
pending upon the number of WMEs that have reached 


either LM... or RMis. 


4.2. Performance evaluation 

The following assumptions are made in the simulation: 
No tokens wait for their partner for more than 1 time 
unit. The routing time for a token to reach any PE is 
set to 1 time unit. Each PE can execute 10 comparison 
tests at a time. On the average, there are 3 patterns per 
With 


the simulation results and assumptions listed above we 


rule. One simulation time unit is set to lpsec. 


identify the following results: 


1. To, the time units for a PE to process one-input nodes 
and variable bindings with 1 WME, < 20. 


2. Ti, the time units for a PE to process a two-input 


node with one WME, < 100, 


. Ty, the time units for 2 PEs to process a two-input 
node with various WMEs, < 125. 


. T,, the time units for 2 PEs to process a negated 


two-input node, < 300. 


. T,, the time units to instantiate a rule that has 1 
regular and 1 negated two-input nodes, = T, + T,, = 


400. 


Suppose that a certain production system has rules 
with average number of two inter-element features (1 
two-input node and 1 negated two-input node) per rule 
and that there is only one WME matched through the 
one-input nodes and stored in each memory. The data- 
flow model would instantiate a rule in 400 simulation 
time units, which is equivalent to 0.4 msec. If there are 
more than 1 WME inatched through one-input nodes and 
stored in each memory, 7, the time taken to instanti- 
ate a rule will be proportional to the number of WMEs 
stored in each memory, as verified by our simulation re- 
sults. When there are on the average n WMEs in each 
memory, 7, ~ 400n = 0.4n msec in the absence of conflict 
resolution. 

When the conflict resolution step (10% of total compu- 
tation time [Fo82]) is taken into account, T, = 0.4(1 + 
10/90)n 0.5n msec, where n is an average number 
of WMEs stored in a memory. This 7, in turn gives 
1000/0.5n = 2000/n rule firings/second. Compared to 
the analysis of the implementation of OPS5 onto DADO 
[Gu84], the choice of a data-flow multiprocessor gives a 
2000/100n = 20/n fold in speed-up since DADO is esti- 


mated to be able to fire below 100 rules/second. 


™ 
o™ 


5. Conclusion 

In this paper, we have explored the potential of data-flow 
multiprocessor systems for the efficient implementation of 
symbolic computations. Among the various data-flow ar- 
chitectures proposed, The MIT Tagged Token Data-flow 
Machine has been chosen for our simulation model. As 
a benchmark of symbolic computations, the Rete match 
algorithm has been chosen. 

Inefficiencies in the implementation of the Rete algo- 
rithm on parallel machines have been identified and pos- 
sible solutions to the problems have been worked out in 
our data-flow environment. Simultaneous distribution of 
many WMEs to many PEs has proven effective in deliv- 
ering the parallelism inherent to the Rete algorithm and 
allowed by a given configuration of our data-flow architec- 
ture. Allocating conditions to different PEs, we have com- 
pletely distributed O(n) iterations throughout the system. 

The Rete algorithm has been successfully implemented 
into a data-flow processing environment. The results we 
obtained reveal that symbolic computations on a data- 
flow multiprocessor computer can indeed be processed ef- 
ficiently. Comparison with conventional computers has 
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shown that a high speed-up could be obtained from this 
approach. However, some problems in applying data-flow 
principles of execution remain unsolved. One of the prob- 
lems is the programmability in high-level language. Also, 
a complete implementation of the conflict resolution step 
will be next undertaken. In conclusion, it appears that the 
data-flow principles of execution are not limited to numer- 
ical processing but will also find applications in some AI 
problems. 
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Abstract 


An important performance characteristic of a parallel 
processor is its ability to implement data permutations; this is 
especially true for massively parallel processors which have 
restricted interconnection networks. Efficient programming of 
a massively parallel processor requires a non—conventional 
programming language. The overhead incurred when using a 
high-level programming language is also an important 
performance issue. The performance of a number of 
fundamental algorithms which have been implemented on 
NASA’s Massively Parallel Processor is presented and the data 
permutation capability of the MPP is examined. These 
algorithms include: data permutations, the FFT, convolution, 
and arbitrary data mappings. The MPP is programmed in the 
high level language Parallel Pascal and the impact of using a 
simple implementation of this language is estimated. 


1. Introduction 


In order to analyze the performance of a massively 
parallel processor system it is necessary to characterize its data 
permutation ability. In this paper a characterization scheme is 
proposed which involves measuring the ability to perform a set 
of regular permutations. These permutations occur in many 
scientific problems and knowledge of their performance may 
also be useful in guiding a programmer to develop efficient 
programs. The performance of these permutations on NASA’s 
Massively Parallel Processor (MPP) is presented and the 
overhead due to their implementation in a high level language 
‘is also given. 


Massively Parallel systems are suitable for the large class 
of scientific applications which involve regular operations on 
large data arrays. The main interest is in evaluating thc 
systems performance for these well matched applications. 
Many of these applications involve matrix operations such as 
the Fast Fourier Transform (FFT) and matrix convolution; the 
performance of both of these operations on the MPP is 
considered in detail. | 


In the remainder of this section the MPP is described and 
the performance of its primitive operations is presented. In 
section two the performance of the MPP for a number of 
important data permutations is considered in detail. The 
performance of the MPP for matrix convolution is considered 
in section three and the FFT is considered in section four. The 
convolution operation can be analyzed by means of the 
primitives presented in section one while analysis of the FFT 
requires knowledge of the data permutations presented in 
section two. Finally, in section five, a heuristic data mapping 
algorithm is considered which is data dependent and requires 
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information to be acquired form the processor array in order 
to determine the instruction sequence. 


1.1. The Massively Parallel Processor 


The Massively Parallel Processor consists of 16384 bit— 
serial Processing Elements (PE’s) connected in 128 x 128 mesh 
[1]. That is, each PE is connected to its 4 adjacent neighbors 
in a planar matrix. The two dimensional grid is one of the 
simplest interconnection topologies to implement, since the 
PE’s themselves are set out in a planar grid fashion and all 
interconnections are between adjacent components. 


The PE’s are bit—serial, i.e. the data paths are all one bit 
wide. This organization offers the maximum flexibility, at the 
expense of the highest degree of parallelism, with the minimum 
number of control lines. The minimal architecture of the MPP 
is of particular interest to study, since any architecture 
modifications to improve performance would result in a more 
complex PE or a more dense interconnection strategy. 


The MPP Processing Element 


The MPP processing element is shown in Figure 1. All 
data paths are one bit wide and there are 8 PE’s on a single 
CMOS chip with the local memory on external memory chips. 
Except for the shift register, the design is essentially a minimal 
architecture of this type. The single bit full adder is used for 
arithmetic operations and the Boolean processor, which 
implements all 16 possible two input logical functions, is used 
for all other operations. The NN select unit is the interface to 
the interprocessor network and is used to select a value from 
one of the four adjacent PE’s in the mesh. 


The S register is used for I/O. A bitplane is slid into the S 
registers independent of the PE processing operation and it is 
then loaded into the local memory by cycle stealing one cycle. 
The G register is used in masked operations; when masking is 
enabled only PE’s in which the G register is set perform any 
operations. Not shown in Figure 1. is an OR bus output from 
the PE. All these outputs are connected (ORed) together so 
that the control unit can determine if any bits are set in a 
bitplane in a single instruction. On the MPP the local memory 
has 1024 words (bits) and is implemented with bipolar chips 
which have a 35 ns access time. The clock cycle time is 100 ns 
which is sufficient for a memory access and an ALU operation. 


The main novel feature of the MPP PE architecture is 
the reconfigurable shift register. It speeds up integer 
multiplication by a factor of two and also has an important 
effect on floating—point performance. 
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Figure 1. The MPP Processing Element 


Array Edge Connections 


The interprocessor connections at the edge of the 
processor array may either be connected to zero or to the 
opposite edge of the array. With the latter option rotation 
permutations can be directly implemented. A third option is 
to connect the opposite horizontal edges displaced by one bit 
position. With this option the array is connected in a spiral by 
the horizontal connections and can be treated like a one— 
dimensional vector of 16384 elements. 


The MPP Control Unit 


A number of processors are used to control the MPP 
processor array; their organization is shown in Figure 2. The 
concept is to always provide the array with data and 
instructions on every clock cycle. The host computer is a VAX 
11/780; this is the- most convenient level for the user to 
interact since it provides a conventional environment with 
direct connection to terminals and other standard peripherals. 
The user usually controls the MPP by developing a complete 
subroutine which is down loaded from\the VAX to the main 
control unit (MCU) where it is executed. The MCU is a high 
speed 16-bit minicomputer which has direct access to the 
microprogrammed processor array control unit (POU). It 
communicates to the PCU by means of macro instructions of 
the form "add array A to array B". The PCU contains 
runtime microcode to implement such operations without 
missing any clock cycles. A first in—first out (FIFO) buffer is 
used to connect the MCU to the POU so that the next macro 
operation generation in the MCU can be overlapped with the 
execution in the PCU. A separate I/O control unit (IOCU) is 
used to control input and output operations to the processor 


Staging Memory 
32 Mb 


Host Computer 
(VAX 11/780) 


Data Bus 


Main Control 
Unit (MCU) - 


1/0 Control 
Unit (1OCU) 


Processor Array 
128 x 128 PE's 


Processor Array 
Control Unit (PCU) 


Figure 2. The System Organization of the MPP 


array. It controls the swapping of bitplanes between the 
processor array and the staging memory independent of the 
array processing activity. Processing is only halted for one 
cycle in order to load or store a bitplane. 


The staging memory is a large data store which is used as 
a data interface between peripheral devices and the processor 
array; it provides two main functions. First, it performs 
efficient data format conversion between the data element 
stream which is most commonly used for storing array data to 
the bitplane format used by the MPP. Second, it provides 
space to store large data structures which are too large for the 
processor array local memory. 


1.2. Parallel Pascal 


Parallel Pascal is an extended version of the Pascal 
programming language which is designed for the convenient 
and efficient programming of parallel computers. It is the first 
high level programming language to be implemented on the 
MPP. Parallel Pascal was designed with the MPP as the 
initial target architecture; however, it is also suitable for a 
large range of other parallel processors. A more detailed 
discussion of the language design is given in [2]. | 


In Parallel Pascal all conventional expressions are 
extended to array data types. There are three fundamental 
classes of operations on array data which are are primitives on 
processor arrays but which are not available in conventional 
programming languages, these are: data reduction, data 
permutation and data broadcast. These operations have been 
included as primitives in Parallel Pascal. Mechanisms for the 
selection of subarrays and for selective operations on a subset 
of elements are also important language features. 


The Parallel Pascal compiler generates a parallel P—code 


[3]. A code generator has been developed for NASA which 
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generates MCU assembly language for procedures which are to 
directly use the MPP. Runtime support is provided in both 
the MCU and the PCU. No code for the PCU is directly 
generated by the compiler. Also, the code generator does not 
have a conventional optimization stage. 


1.3. Performance Measurements 


The execution times of several primitive operations 
implemented in Parallel Pascal were measured using program 


loops and timing routines. For example, to measure the time 
for a parallel array multiplication the following program 
segments were timed: 


var a,b: Parallel array [1..128,1..128] of real; 


Program segment #1: 
For i := 1 to 1000 do 
begin 

ei==b 

end; 


Program segment #2: 
For i := 1 to 1000 do 
begin 

a:s=a*b 

end; 


The time required to execute the first program segment was 
6.60 msec, and the time to execute the second program 
segment was 87.7 msec. The difference between these time 
measurements is due to the 1000 array multiply operations. 
Therefore, the time required for a parallel multiplication is 
given by: 

_ toert 
1000 


where the t,,, is the time for a real multiplication, and #,,, and 
tag are the measured times for program segment #1 and 
program segment #2 respectively. 


1.4. MPP Primitive Operations 


The cost of the basic primitive parallel operations of the 
MPP, when programmed in Parallel Pascal, were measured 
using the procedure outlined above. When calculating the 
execution times of operations on 8—bit integers or Boolean data 
types, the time to execute program segment #1 is equal to 6.48 
msec. The measured operation costs are presented in Table 1. 
Optimal times for these operations were estimated for the 
processor array by itself; these are also presented in Table 1. 
Optimal floating point arithmetic times were obtained from 
[1], and the remaining optimal times were derived by counting 
the clock cycles for optimal microcode instruction sequences 
applied to the PE array hardware. The difference between the 
measured and optimal times is due to three main factors: (a) 
MCU overhead, (b) overhead introduced by the Parallel Pascal 
compiler, and (c) overhead in the PCU microcode. 


From Table 1. we can see that, Boolean operations are the 
least efficiently implemented. The MCU adds an overhead of 3 
or more psec. per operation, which in the case of Boolean 
operations, dominates the execution times and causes an order 
of magnitude in loss of performance. On average, the Boolean 
measured times are about 20 times slower than the 
corresponding optimal times. For floating point operations, 
the 3 or more psec. overhead is negligible since the execution 
times of the operations are on the order of tens or hundreds of 
usecs. The floating-point add operation required about twice 
the stated optimal time. We do not know the reason for this; 
one possibility is that the run time implementation is 
significantly different from that used by Batcher in [1]. It is 
important to note that the implementation of floating-point 
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Table 1. Optimal and Measured execution times of some typical 
operations. 


|__Operation __||_ Optimal time 


assignment r 
assignment i8 
assignment b 
addr 

mult r 

mult r Xs 
sin r 

add i8 

mult i8 

mult i8 Xs 
div i8 

mod i8 

trunc 

round . 

and 

or 

not 

odd(i8) 

any 

min(r) 
max(r) 
min(i8) 
max(i8) 
compare 18 
where b 
shift(r,0,1) 
shift(r,0,64) 
shift(r,64,64) 
shift(i8,0,1) 
shift(i8,0,64) 
shift(i8,64,64) 
shift(b,0,1) 
shift(b,0,64) 
shift(b,64,64) 
procedure call 


time in psec. 

s = scalar, r = array of real, i8 = array of 8—bit integer, b = 
array of Boolean. 

(*) The measured time for a procedure call varies according to 
the variables passed on the call. 


operations is programmable since the PE’s are bit serial. For 
example, Batcher used the IBM format while the current MPP 
run-time system implements the VAX floating-point format. 
Finally, it was noted that shift operations are least efficient 
when either the number of bits to be shifted is small or the 
shift distance is small. 


1.5. Optimal Performance Estimation 


In the following, when an algorithm is presented, the 
corresponding measured and estimated execution times will be 
given. The measured times were obtained using the timing 
functions of the MPP as previously discussed for primitive 
operations. The estimated optimal times were calculated by 
tracing the Parallel Pascal code of the algorithm and adding 
the optimal execution times, given in Table 1, of the 
encountered array instructions. Any scalar operations, 
executed by the MCU, were assumed to be concurrent with the 


execution of the array operations, and thus, were not taken 
into account. As an example, the following is a program 
segment of the shuffle permutation, for which the calculation 
of the estimated execution time is shown. 


x:= il; y:=0 —l- 
tmx := mx; . —2-— 
tmx2 :== mx; —3-— 
num :== 2; —4— 
while num < tn do —5~ 
begin 
tmx1 := shift(tmx1, —x, -y); (* shift down *) —6- 
where id = num do —7T— 
mx := tmx]; —8— 
num :== num + 2; —9- 
end; 


where z and y are integers, and num and tn are 8-bit integers 
(tn = 128); td is a parallel array of 8-bit integer; mz, tmz, 
and imz2 are parallel arrays of either 32-bit reals, 8-bit 
integers, or Boolean. Let the number of bits in the data 
elements beN, . 


The optimal execution time is equal to the sum of: 
e 2 assign array of N, = 2 x (0.2 x N,) (-2- and —3-) 
the following are executed = — 1 (i.e. 63) times: 
e shift array of N, by 1 = 0.3 x N, (-6-) 
e where b + compare i8 = 0.1 + 2.5 (-7-) 
e assign array of N, = 0.2 xX N, (-8-) 


Note that an assignment operation takes two cycles per bit, 
and a shift operation takes per bit two cycles (for load and 
store) plus the shift distance. ‘The total optimal time is equal 


to 31.9 X N, + 163.8 wsec. For example, for arrays of 8-bit | 


integer where N, = 8 , the optimal time is 419.1 psec. 


1.6. The Transfer Ratio 


A comparative measure, the transfer ratio [4], is used to 
express the cost of an algorithm. The transfer ratto is defined 
as the ratio of the time for the data transfer over the time for 
an elemental operation. The time for an elemental operation is 


defined as the average between the time of a multiplication and 
the time of an addition on the processor array. Based on the 
optimal execution times given in Table 1., the time for an 
elemental operation for 32-bit floating point elements is 57 
psec, and for 8-bit integer elements it is 5.65 psec. For 
Boolean elements, the time for an elemental operation is taken 
to be two clock cycles i.e., 0.2 psec [4]. 


2. Data Permutations 


The performance of the MPP for a number of classical 
regular data permutations is considered here; however, the 
method presented is not limited to these permutations. A 
permutation function is performed on an ordered set of N 
elements, and it is defined by a one-to-one function a (a) [5]. 
Both x and a (x) are integers between 0 and N-1; z and z (2) 
represent the addresses of the elements before and after the 
permutation, respectively. 


2.1. The Shift Permutation 


_ The near—neighbor interconnection network of the MPP 
can only directly implement the shift permutation. This 
permutation is possible because the PE array has, in addition 
to the near neighbor connections, toroidal end—around edge 
connections. Any other permutation can only be achieved 
through the shift permutation. 


In Parallel Pascal, the shift permutation is specified with 
the built in function ’rotate’. Its arguments consist of the array 
to be shifted, and the amount and direction of the shift. For 
the MPP, the cost of a shift permutation (¢,) depends on the 
amount of the shift (d) and on the number of bits of the array 
elements (N,). The optimal cost for the shift permutation is 
given by 


t, =(2+d)*01* MN, psec 


2.2. The Test Permutations 


The permutations which have been implemented for the 
MPP are exchange (e), shuffle (0), butterfly (8), and bit reversal 
(pe). Where applicable, the respective Sub and Super 
permutations were also implemented. Conceptually, a Sub 
permutation involves a partitioning of the vector into groups 
of adjacent elements such that each group is mapped into itself 
only. A Super permutation involves partitioning the vector 
into groups of adjacent elements such that each group moves 
the same distance in the permutation. In general, the 
implementation of a sub or super permutation involves less 
work on the MPP than for the regular form of the 
permutation. 


A permutation will be defined by considering the binary 
representation of x: 


zx = (b,, ba) oes 554) 


2=b, 27-2 +46,, 2" 7+...+5, 


The above expressions represent the binary address of an 
element in N = 2” , and a permutation is defined by the 
permutation on the bits of this address [5]. 


The Exchange Permutation 
€(k) (x) =(b,, Srey) b,, ee 


The exchange permutation consists of complementing bit 
k of the input address. Thus, this permutation consists of 
exchanging every pair of elements, where two elements form a 
pair if their addresses are the same except for the kth bit. 


46;) where 1<k<n 


In the exchange permutation all the elements move an 
equal distance, except that half of the elements move in one 
direction and the other half move in the opposite direction. 
This procedure is accomplished in two steps: 


1— Calculate the amount of the shift which is equal to 2*-} 


2-— Perform the exchange of pairs: where b, = 1 then shift 
up (or left), and where 6, = 0 then shift down (or right). 


The Shuffle Permutation 
oc (x) a (b,-1) by 95 eeey b,, b,) 


The shuffle permutation consists of a circular left shift of 
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the bits of the input address. The resulting permutation 
consists of splitting in half the set of N elements, and then 
interleaving them like in a perfect card shuffle. 


The algorithm for the shuffle permutation is as follows: 


1 — Map the upper half of the input matrix by shifting the 


elements down a total of a 1 steps. For each shift 


down, one element will be located in the correct position, 
and therefore stored in the result array. 


2 — Map the lower half of the input matrix. The same 
procedure of step 1 is followed except that the elements 


are shifted up. 
The Sub-Shuffle permutation is specified by: 


7 (k) (x) = (b,, ee ee yg b,, b,) 


In the sub-shuffle permutation, the set of elements is 
divided into 2"—* groups, each one of size 2*, and a perfect 
shuffle is performed on each of the subgroups. The sub-shuffle 
algorithm is the same as for the shuffle, except that there are 
2"-* halves, and total amount of the shift is 2*-! — 1. 


The Super—Shuffle permutation is specified by: 
ol) (z) =(b,-1 + - 3) 


The super—-shuffle permutation, performs a perfect shuffle on 
the whole set, except that now an ’element’ consists of a group 
of 2"-* elements. The super—shuffle algorithm is similar to the 
shuffie algorithm, except that instead of shifting by one, a shift 
by 2"-* is performed at each step. 


Cn) by 4) by 45 r) 


— 


°9 by ka b,, bn ks sai 


The Butterfly Permutation 
B (a) = (by, by, ~~ +, ba, by) 


The butterfly permutation consists of exchanging the most 
significant bit (MSB) and the least significant bit (LSB). Three 
cases arise from this exchange: First, if the bits are equal (i.e. 
both are equal to 1 or to 0), the permuted addresses are 
unchanged, and therefore the corresponding elements remain in 
their initial positions. Second, if MSB = 1 and LSB =0, the 
corresponding elements have to move up 2"! —1 locations 
away. Third, if the MSB = 0 and LSB =1, the corresponding 
elements have to move down 2"~+ — 1 locations away. 


Similarly to the butterfly, the sub-butterfly permutation 
consists of exchanging bit k (MSB) and bit 1 (LSB); and the 
super—butterfly permutation consists of exchanging bit n (MSB) 
and bit n-k+1 (LSB); The binary representations of these two 
permutations are: 


Aer) (x) ~ (8,,, oe ny by 44) b.45 b, ea ey b,) 


BY) (2) = (by naa + , by) 


“9 bn —k+2 bn, bak me 


The butterfly algorithm consists of two steps: 


1 — Generate two Boolean masks: ’shiftup’ indicates the 
positions where the MSB of the address is equal to 0 and 
the LSB is equal to 1, and ’shiftdown’ indicates the 
positions where the MSB is equal to 1 and the LSB is 


equal to 0. Nothing needs to be done where MSB = LSB. 


Perform the shifts: where ’shiftup’ is true, the elements 
are obtained with a shift up by a distance of 2%! — 1; 
where ’shiftdown’ is true, the elements are obtained with 
a shift down by a distance of 2"~! — 1. 
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The same algorithm applies for the sub and super 
butterfly permutations, except that in the sub—butterfly a 
one —1 shift is performed, and in the super—butterfly a 
(2*-1 — 1)(2"-*) shift is performed. 


The Bit Reversal Permutation 
p (z) = (by, bg, -. +5 On_ay ba) 


The bit reversal permutation consists of reversing the 
order of the bits of the input address. Similarly, the sub—bit 
reversal at bit k reverses the k least significant bits: bit k to 
bit 1, and the super—bit reversal reverses the k most significant 
bits: bit n to bit n—k+1. The binary representations of these 
two permutations are: 


P(k) (x) = (b,, ee eg bp 4, b, bo, a | b, ) 


., by) 


The bit reversal permutation can be achieved with a series 
of bit exchanges between the pairs of corresponding MSB and 
LSB, i.e. between 6, and b,, 6,_, and by,..., 6 and 6,44 


p\*) (2) = (On_pats Op—eaer s+ +s Ons Opes + + 


n—-m 


where m < = ; Therefore, a bit reversal can be realized by a 


series of butterflies where the distance of each shift is 
calculated from the bits position numbers. Exchanging MSB 
bit j with LSB bit i, requires a shift of length 27-1 — 2'1. The 


algorithm consists of iterations, where each iteration 


consists of: 


1 — Determine which pair of bits is to be exchanged, and 
calculate the amount of the shift. 
2- Similarly to the butterfly, create the shifts masks, and 


then perform the shifts. 


The same algorithm applies for the sub and super bit 
reversal permutations, except that the bit exchanges are 
performed only to the k least significant bits or the k most 
significant bits, respectively. In both cases, the number of 


iterations is equal to S 5 


2.3. Performance Evaluation 


Each permutation was coded in Parallel Pascal and was 
run on the MPP using three types of data: 32-bit floating 
point, 8—bit integers, and Boolean. Program details and the 
timing results are given in [6]. These programs accept a 
parameter k and generate the masks for the permutations 
each time that they are called. Some of the overhead incurred 
by the initialization code could be avoided. 


The results from the optimal time estimations are used 
first to characterize the performance of the MPP processor 
array hardware for data permutations, then these are 
combined with the measured results to characterize the 
efficiency of the Parallel Pascal compiler and the MPP control 
units. 


Due to the orthogonality of the test permutations, the 
performance of the MPP can be characterized by considering 
only one dimension of the processor array. On the MPP each 
permutation is performed concurrently on each row of the 
processor array; i.e., 128 sets of 128 elements (the performance 


for permuting the columns is identical). Since all the 
permutations considered are orthogonal with respect to the 
two dimensions of the MPP mesh connections, these results 
may be simply extended to the case of permuting a 16384 
element vector (or 128 x 128 matrix); the transfer ratio cost 
will be doubled and the compiler efficiency will remain the 
same. 


For each permutation, expressions for the optimal 
execution times (to,) for a one dimensional permutation on the 
MPP have been derived from the optimal execution times of 
the primitive operations given in Table 1. These execution 
times are optimal in the sense that they represent the cost of a 
direct translation of the highl level language program without 
any overhead from the program control unit; i.e. the processor 
array is doing useful work on every clock cycle. In some cases 
a faster realization could be achieved by careful programming 
at the microcode level. In Table 2. expressions are given for 
sub-permutation costs to, , and super-permutation costs 
to, j» Where k is an integer in the range of 1 to 7 which is 
related to the group size. In the time expressions, NV, 
represents the number of bits in the data elements: N, is equal 
to 32 for floating point elements, to 8 for integer elements, and 


to 1 for Boolean elements. The time unit used in all 
expressions is wseconds. 


Table 2. Optimal Permutation Costs for one dimension of the 


to, , |52 +N, *(0.6+0.1 * 2°) 
to, ,|17.6 + 0.4 * N, +(0.5 * N, +2.6) (2% — 2) 
[17-6 +0.4 * N, + [(12.8 * 2-* +.0.4)N, + 2.6](2* — 2) 
tog ,|0.4 * N, + 24.2 +0.1 * N, 24 
13.4 * N, + 24.2 — 25.6 * N, *2-* 


k 


2 


(113 +0.6 * N,) < +0.2(2* —2 


k 


2 


k 
+25.6(1 — 2 ; 


(113 + 0.6 * N,) 


The permutation transfer ratios for the different data 
types are given in Figures 3-5. The horizontal axis spans the 
range of sub and super permutations with the usual 
permutation at k 7 since, when N 128, 
az) = m7 (2) = (7) (x). For the exchange permutation, the 
kth sub and super permutations are considered to be identical. 
Note that the transfer ratio is plotted on a logarithmic scale 
and that the actual time in ysecs is also provided at the right 
side of each graph. 


The transfer ratios are in the range 1-20 for floating 
point data, 6-200 for integer data and 100-2000 for Boolean 
data. Recall that the transfer ratio is the number of 
arithmetic operations of that data type which can be 
performed in the time of one permutation. Ideally, if the 
transfer ratios can be made less than 1 then permutations will 
not impact the performance of the system. In many cases the 
range for floating point numbers will not be a problem since a 
number of operations are typically performed between 
permutations in most algorithms. However, there is a 
significant difference in performance between the shuffle and 
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Figure 3. Transfer Ratios for Floating Point data Permutations 
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Figure 4. Transfer Ratios for 8-bit Integer Permutations 
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Figure 5. Transfer Ratios for Boolean data Permutations 


bit reverse permutations compared to the others and these 
permutations should be avoided if possible on the MPP. 


For the integer and Boolean data these permutations may 
incur a significant overhead unless they occur very infrequently 
in the algorithm. For small data sizes the additional 
operations necessary to implement the permutation dominate 
the permutation cost especially for the shuffle and bit reverse 
permutations. Since the performance of the permutations 
differs by more than an order of magnitude; the careful 
selection of the most appropriate permutation for a task is 
even more important than for the floating point case. 


The efficiency of the programming language and control 
units for a permutation is expressed by the ratio of the optimal 
time to the measured time. The percentage efficiency of the 
Parallel Pascal implemented permutations is shown for the 
different data types in Figures 6-8. The efficiency for the 
permutations ranges from 30-75 percent for floating point 
data, 20-50 percent for integer data and 10-28 percent for 
Boolean data. 


The thee main potential causes for inefficiency are the 
lack of code optimization, MCU overhead, and the loss of 
useful cycles in the PCU microcode. The first inefficiency 
causes the processor array to do extra useless work such as 
copying arrays to temporary buffers, the second occurs when 
the MCU has too many operations to perform between MPP 
macro operations such that the FIFO buffer empties and the 
POU waits idle for the next operation, the last case may occur 
when the microcede architecture is unable to perform all 
‘necessary operations in a single 100 ns cycle such that the 
array must miss a useful processing cycle. 


It is difficult to ascertain the exact cause of inefficiency 
from Figures 6—-8.; however, it is possible to determine the 
benefit which could be achieved with an optimally coded 
control system. The implementation of floating point data 
permutations is already reasonably efficient in most cases while 
the implementation of the Boolean data types is not very 
efficient. The results suggest that the main loss of efficiency is 
caused by the MCU overhead. The code generated by the 
Parallel Pascal Compiler is of a similar efficiency for each data 
type; furthermore, it is reasonable to assume that the PCU 
unit can efficiently manage single bitplane data. On the other 
hand, Boolean data places the largest load on the MCU since it 
must generate macro instructions for the PCU at a much 
higher rate than for other data types. 


3. Convolution 


Convolution is an important operation in many signal 
processing and image processing applications. A  two- 
dimensional convolution involves convolving a large data 
matrix D with a given small matrix W, the convolution 
kernel, as specified by 


Rli,j] = x x W{2,y] * DL t+, j+y] 


r=—m y= —m 


where W is of size (2m + 1) X (2m + 1). 


Conceptually, the convolution result for a given element 
R |t,j| is obtained by superposing the W kernel onto the 
matrix (with the center of W at position i,7), and multiplying 
each kernel element with the corresponding matrix element. 
The convolution result for 2,7 is then equal to the summation of 


these products. 


On the MPP, D is distributed on the processor array and 
a convolution operation is implemented by a series of shift— 
multiply-add operations (one for each element of the kernel). 
The performance, on the MPP, of a two-dimensional 
convolution operation involving a 5X5 kernel was examined. 
The measured execution time for matrices with 8—bit integer 
elements was 987 psec, and for matrices with 32-bit floating 
point elements was 5.28 msec. This is a processing rate of 830 
MOPS for 8—bit integer data and 155 MFLOPS for 32-bit 
floating point data. The time required for just the arithmetic 
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Figure 6. Efficiency of Floating Point data Permutations 
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Figure 7. Efficiency of 8-bit Integer Permutations 
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Figure 8. Efficiency of Boolean data Permutations 


computations is 325 psec (33% of the total time) for 8-bit 
mteger data and 3.92 msec (74% of the total time) for 32-bit 
floating point data. 


4. The Fast Fourier Transform 


A Fast Fourier Transform (FFT) program, that involves 
the #, o, and p permutations, was developed. The Fourier 
Transform is the frequency domain representation of a 
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function and it is frequently used in several different scientific 
applications. The FFT is a fast method to compute the 
Discrete Fourier Transform (DFT), since it reduces the 
calculation of the FT from O(n”) for the DFT to O(n logon). 
An N-point DFT is specified by: 

N-1 


J} 2, (k) * w™, 
k=0 
27 


——e 


j 
wherew =e ™% and N isa power of 2 |7]. 


—— 


X (n) n=1,2,...,N-1 


4.1. The FFT Algorithm 


The input to the FFT program is an NXN array, and the 
FFT is performed concurrently on either the rows or the 
columns of the array, depending on the coordinate chosen. 
Consequently, a two dimensional FFT is achieved by 
performing an FFT on the columns and then on the rows, or 
vice versa. 


In general, the algorithm used to perform a FFT to a 
given row of length N (or column) is defined as follows: 
For a bit reversed result: 


FFT, = Bin) W Bint) W s. - Bee) Wa W where N = 2" 
For a naturally—ordered result: 


FFT, = Bin) W Bn-1) W--- Ba) Wo W p where N = 2" 


8, o, and p refer to the permutations presented in the previous 
section; W represents the following operation: 


ty = ty tw *y y 
= rr: 
yo = Fy WwW V-i 
sagen. 
N 


where w =e , p represents the power of w, / represents the 
iteration number, and x and y are dual nodes ( [7] p. 154 ). 


This is only one formulation of the FFT algorithm; for 
example, an alternative well known FFT formulation is 
obtained if every @ is replaced by a ao. The best permutation 
sequence to use depends upon the relative speeds of the regular 
permutations discussed in section 2. For the applications 
programmer, the performance of the permutations on the total 
system is required. The transfer ratios obtained from the time 
measurements, normalized by the attainable floating point 
arithmetic speed of 80 ws are shown in Figure 9. From this 
graph it can be seen that the butterfly permutation is an order 
of magnitude faster than the shuffle permutation and sub 
butterflys are faster still. Therefore, it will always be better to 
use butterfly permutations on the MPP rather than shuffles 
whenever possible. Furthermore, a reasonable computational 
performance may be anticipated § since all 
permutations have a transfer ratio of less than 10 which is the 
order of the number of operations involved with each node pair 
calculation. 


As mentioned previously, the FFT algorithm has 
complexity O(n log.n), but on the MPP, it takes logon steps 
because there are N PE’s for N elements, as opposed to one PE 
for N elements (N = 128). In the MPP, a step or iteration / 
consists of the following: 


1— Perform the f(,_;,1) permutation except when | =n, in 


this case a o permutation is performed. At this point all 
the dual nodes pairs are located in adjacent PE’s. 
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Figure 9. Measured Transfer Ratios for 
Floating Point data Permutations 


Table 3. FFT operation Times 
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41.9 
64.6 — 
87.3 
110.0 
133.0 
4.2 
0 


Total FFT || 14820 | 


2— Calculate the weights, w?’s. First the values of the p’s are 


butterfly | 
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generated, these are dependent on I! and on the 
corresponding array position. Then the real part of w is 


equal to cos (=e p) and the imaginary part is equal to 
7 a0 
— sin(—— p). 
(a> P) 


Calculate 2,’s and y,’s, i.e. perform the W operation 
presented above. Initially, the 2,_,’s are in the even 
numbered PE’s and a copy is send to the corresponding 
dual odd numbered PE; the opposite is done to the y,_1’s. 
Then the complex multiplication and addition are 
performed. 


4.2. Performance Evaluation 
An FFT, has been investigated on the MPP which is 
defined by: 
FFT, = By W Be) W Boy W Buy W Ba) W Bay Wo W 


The times for the various operations of the FFT algorithm are 
given in Table 3. The execution times of these permutation 
functions are more than the execution times of the same 


permutations presented in section 2, because, in the FFT, the 
data elements used are complex numbers i.e. two 32-bit 
floating point elements. The execution time of the W 
operation is independent of |, the iteration number. The pgen 
function generates the p values and its execution time depends 
on the value of I. 


The measured time for the FFT is greater than the 
estimated optimal time by a factor of 1.5. The execution times 
of the FFT’s are dominated by arithmetic operations (W 
calculation) and by shifting operations (permutations); no 
reduction functions are used. The MCU overhead is likely to 
be minimum since most of the operations are array operations 
of 32-bit floating point elements; therefore, the differences 
between estimated and measured times are probably mainly 
due to the lack of code optimization and the inefficient 
implementation of Parallel Pascal primitive operations. 


As a performance measure of the FFT, the number of 
MFLOPS was calculated. An operation is defined as either a 
floating point multiplication or an addition. It was determined 
that the sine and cosine operations, used in the calculation of 
the weight factors, have a measured execution time equal to 
334 psec, and thus, a sine or cosine operation on the MPP has 
a cost equivalent to four floating point multiplication 
operations. 


In an FFT, two main calculations are performed at each 
iteration: first, the weights factors are generated, and second, 
the W operation is performed. In general, at each iteration 9 
floating point operations per processor are necessary for the 
weights calculation and 8 are necessary for the W operation. 
There are seven iterations on 128” processors, and the FFT 
takes 0.0224 seconds. Therefore, the FFT calculation achieves 
an approximate rate of 87 MFLOPS. 


In addition to the MFLOPS calculation, the percentage of 
time spent on shift operations and the percentage of time 
spend on arithmetic calculations were found. The shift 
operations consist of the butterfltes and the shuffle 
permutations, and the arithmetic operations consists of the 
weight factors calculation (pgen and the sine and cosine 
operations), and of the W operation. The FFT program 
spends approximately 46% of the execution time on shift 
operations, and approximately 54% on arithmetic operations. 


5. Arbitrary Data Mapping 


A parallel data mapping for a two-dimensional matrix 
may be specified by. two coordinate—index matrices. These are 
called the r matrix (for row index) and the ¢ matrix (for 
column index). A mapping from an input matrix M to a result 


matrix R is specified by 
Rii,a] = M|r(és3], eli. 


Two algorithms have been explored to perform this mapping 
function: a simple direct algorithm and a heuristic algorithm. 


5.1. The Simple Algorithm 


The simple algoritlyn requires every element of the input - 


matrix to be passed by every position of the output matrix. 
When an element is located at the appropriate output position, 
that is, when it has moved the correct distance according to 
the r and c matrices, its value is then stored at this position. 
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For i:= 1 ton do 


begin 
For j :—= 1tondo 
begin 
where r — i and c = j do 
R:=M; 
M := rotate(M, 0, 1); 
end; 
M := rotate(M, 1, 0); 
end; 


The cost complexity of this algorithm is O(n”) where n is 
equal to the number of rows (or columns) in the matrix. This 
algorithm always requires n” iterations and a total of n?--n 
data rotations. 


5.2. The Heuristic Algorithm 


A heuristic algorithm [8] has been developed for the MPP 
which takes advantage of uniformity or locality existing in the 
movement of the data elements. Its performance depends upon 
the available locality in the specific mapping operation. The 
general concept is that the matrix M is only shifted to 
positions where data elements are to be mapped to R. In order 
to know how far M can be moved in one step it is necessary to 
find the shortest distance to the next needed displacement. 
This is achieved by min reduction functions applied to relative 
versions of the c and r matrices. Therefore, the number of 
iterations is reduced but each iteration is more complex since a 
calculation involving two or more reductions is needed to 
determine the displacement to the next iteration. 


The worst case cost of the heuristic algorithm is O(n”), as 
this algorithm may conceivably require up to n?” iterations; 
however, this has never been observed in practice. 


5.3. Performance Evaluation 


The performance results obtained from applying the 
heuristic algorithm to a set of six different matrix rotation 
mappings are shown in Table 4. The performance of the 
simple algorithm (which is the same for all data mappings) is 
shown in the last row of Table 4. The first two columns give 
the number of iterations and bit—shift operations required for 
each mapping. The next column ‘gives the number of 
reductions required by the algorithm The last four columns 
give the estimated and measured timing results for 8 and 32- 
bit data; all times in this table are in milliseconds. 


The results show that, even with the more complex 
iterations, the heuristic algorithm is significantly faster than 
the simple algorithm for most of the given data mappings 
when the optimal times are considered. However, in practice, 
the simple algorithm is usually faster. The measured results 
for the simple algorithm were about 4 times longer than the 
optimal results for 8—bit data and 2 times longer for 32-bit 
data which is comparable to previous results. For the 
heuristic algorithm the corresponding figures are about 12 for 
8—bit data 8 for 32—bit data. 


Three main factors which contribute to the unusual 
behavior of the heuristic algorithm are as follows. First, the 
implementation of the primitive reduction functions is not very 
efficient (see Table 1.). The implemented functions are more 


Table 4. Cost of the Heuristic aan lala for Matrix Rotation 


imple [ses | aasie [0 [ ass | 000 


than 4 times slower than the optimal case and 44% of the 
estimated execution time of the heuristic algorithm is 
accounted for by these reductions. The simple algorithm does 
not involve any reductions. Second, after a reduction function 
is executed the result is used in a conditional branch statement 
in the MCU. This causes the FIFO buffer to empty and the 
MCU must do additional work before the next macro 
instruction can be generated. Third, 20% of the optimal 
estimated time is accounted for by Boolean array operations; 


Boolean operations are the least efficiently implemented when 


controlled from the MCU. On average Boolean operations are 
about 20 times slower than the optimal case. The simple 
algorithm does not involve any Boolean operations. 


Conclusion 


The capability of the MPP to perform a set of regular 
permutations has been studied in detail. The results indicate 
that the optimal implementation times for floating point data 
transfers may be reasonable for many applications but the 
permutation of small length data may not be very efficient. 
The performance of the total system for permutations is also 
quite good for floating point data but significant savings might 
be made if the shorter data type permutations were 
reprogrammed in PCU microcode. The analysis techniques 
presented could be applied to other highly parallel 
architectures. 


Several characteristic algorithms have been considered: 
convolution which is implemented with a few primitive 
operations, the FFT which involves significant data 
permutations, and the heuristic algorithm which has a data 
dependent behavior. In general, the Parallel Pascal Compiler 
performed well for large data types and deterministic 
algorithms (which provided the lightest load for the MCU). It 
did not perform as well for complex algorithms involving 
Boolean data or reduction functions; however, in this case it 
was still quite adequate for algorithm prototyping. It is not 
clear that an optimizing compiler would be very much faster, 
for the difficult algorithms, unless it generated PCU microcode 
for critical sections. 


In terms of processing speed, using Batchers figures [1] 
the peak performance of the MPP is 288 MFLOPS; from our 
primitive operation measurements the fastest rate we could 
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expect to attain is 210 MFLOPS (due to the slower add time). 
For the convolution algorithm a rate of 155 MFLOPS was 
attained and for a 128 x 128 FFT the rate was 87 MFLOPS. 
These algorithms were conveniently programmed in Parallel 
Pascal. Furthermore, 128 x 128 is the worst case size for the 
FFT implementation on the MPP; for either larger or smaller 
matrix sizes the comparative overhead due to interprocessor 
communication would be less. 
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Abstract 


Until now, most results reported for parallelism in production 
systems (rule-based systems) have been simulation results -- very 
few real parallel implementations exist. In this paper, we present 
results from our parallel implementation of OPS5 on the Encore 
multiprocessor. The implementation exploits very fine-grained 
parallelism to achieve significant speed-ups. For one of the 
applications, we achieve 12.4 fold speed-up using 13 processes. 
Our implementation is also distinct from other parallel 
implementations in that we parallelize a highly optimized C-based 
implementation of OPS5. Running on a uniprocessor, our C-based 
implementation is 10-20 times faster than the standard lisp 
implementation distributed by Carnegie Mellon University. In 
addition to presenting the performance numbers, the paper 
discusses the amount of contention observed for shared data 
structures, and the techniques used to reduce such contention. 


1. Introduction 

As the technology of production systems (rule-based systems) is 
maturing, larger and more complex expert systems are being built 
both in industry and in universities. Often these large and complex 
systems are very slow in their execution, and this limits their 
utility. Researchers have been exploring many alternative ways for 
speeding up the execution of production systems. Some efforts 
have been focussing on _ high-performance uniprocessor 
implementations [2, 10], while others have been focussing on high- 
performance parallel implementations [3, 6, 11,9, 12, 13, 14]. 
This paper focusses on parallel implementations. 


Until now, most results reported for parallelism in production 
systems have been simulation results. In fact, very few real 
parallel implementations exist. In this paper, we present results 
from our parallel implementation of OPS5 on an Encore Multimax 
shared-memory multiprocessor with sixteen CPUs. The 
implementation, called PSM-E (Production System Machine 


project’s Encore implementation), exploits very fine-grained 
parallelism to achieve up to 12.4 fold speed-up for match using 13 
processes. Our implementation is distinct from other parallel 
implementations in that we parallelize a highly optimized C-based 
implementation of OPSS. This is in contrast to other efforts where 
Slow lisp-based implementations are being parallelized. Running 
On a uniprocessor, our C-based implementation is 10-20 times 
faster than the lisp implementation of OPS5 distributed by 
Carnegie Mellon University. A consequence of parallelizing a 
highly-optimized implementation is that one must be very careful 
about overheads, else the overheads may nullify the speed-up. One 
need not be as careful when parallelizing an unoptimized 
implementation. In this paper, we first discuss the design of an 
optimized implementation of OPS5, and then discuss the additions 
that were made for the parallel implementation. For the parallel 
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implementation, we discuss the synchronization mechanisms that 
were used, the contention observed for various shared data 
structures, and the techniques used to reduce such contention. 


The paper is organized as follows. Section 2 presents some 
background information about the OPS5 language, the Rete match 
algorithm, and the Encore Multimax multiprocessor. Section 3 
gives an overview of the parallel interpreter and then goes into the 
implementation details describing how the rules are compiled and 
how various synchronization and scheduling issues are handled. 
Section 4 presents the results of the implementation on the Encore 
multiprocessor. Finally, in Section 5 we summarize the results and 
conclude. 


2. Background 

This section is divided into three parts. The first subsection 
describes the basics of the OPS5 production-system language -- the 
language which we have implemented in parallel. The second 
subsection describes the Rete algorithm -- the algorithm that forms 
the basis for our parallel implementation. The third subsection 


describes the Encore Multimax computer system -- the 
multiprocessor. on which we have done the parallel 
implementation. 

2.1. OPS5 


An OPS5 [1] production system is composed of a set of if-then 
rules called productions that make up the production memory, and 
a database of temporary assertions called the working memory. 
The assertions in the working memory are called working memory 
elements (wmes). Each production consists of a conjunction of 
condition elements corresponding to the if part of the rule (also 
called the left-hand side of the production), and a set of actions 
corresponding to the then part of the rule (also called the 
right-hand side of the production). The actions associated with a 
production can add, remove or modify working memory elements, 
or perform input-output. Figure 2-1 shows a production named 
find-colored-block with two condition elements in its left-hand side 
and one action in its right-hand side. 


(p £find-colored-block 


(goal “type find-block “color <c>) 

(block “id <i> “color <c> “selected no) 
——> 

(modify 2 “selected yes) ) 


Figure 2-1: A sample production. 


The production system interpreter is the underlying mechanism 
that determines the set of satisfied productions and controls the 


‘This paper presents only a summary of the results. A more in depth analysis is 
presented in [5]. 


execution of the production system program. The interpreter 
executes a production system program by performing the following 
recognize-act cycle: 

e Match: In this first phase, the left-hand sides of all 
productions are matched against the contents of 
working memory. As a result a conflict set is 
obtained, which consists of instantiations of all 
satisfied productions. An instantiation of a production 
is an ordered list of working memory elements that 
satisfies the left-hand side of the production. 


e Conflict-Resolution: In this second phase, one of the 
production instantiations in the conflict set is chosen 
for execution. If no productions are satisfied, the 
interpreter halts. 


e Act: In this third phase, the actions of the production 
selected in the conflict-resolution phase are executed. 
These actions may change the contents of working 
memory. At the end of this phase, the first phase is 
executed again. 


A working memory element is a parenthesized list consisting of 
a constant symbol called the class of the element and zero or more 
attribute-value pairs. The attributes are symbols that are preceded 
by the operator “. The values are symbolic or numeric constants. 
For example, the following working memory element has class Cl, 


the value 12 for attribute attr] and the value 15 for attribute attr2. 


(C1 attr1l 12 “attr2 15) 


The condition elements in the left-hand side of a production are 
parenthesized lists similar to the working memory elements. They 
may optionally be preceded by the symbol -. Such condition 
elements are called negated condition elements. Condition 
elements are interpreted as partial descriptions of working memory 
elements. When a condition element describes a working memory 
element, the working memory element is said to match the 
condition element. A production is said to be satisfied when: (1) 
For every non-negated condition element in the left-hand side of 
the production, there exists a working memory element that 
matches it; (2) For every negated condition element in the left-hand 
side of the production, there does not exist a working memory 
element that matches it. 


Like a working memory element, a condition element contains a 
class name and a sequence of attribute-value pairs. However, the 
condition element is less restricted than the working memory 
element; while the working memory element can contain only 
constant symbols and numbers, the condition element can contain 
variables, predicate symbols, and a variety of other operators as 
well as constants. Variables are identifiers that begin with the 
character "<" and end with ">" -- for example, <i> and <c> are 
variables. A working memory element matches a condition 
element if they belong to the same class and if the value of every 
attribute in the condition element matches the value of the 
corresponding attribute in the working memory element. The rules 
for determining whether a working memory element value matches 
a condition element value are: (1) If the condition element value is 
a constant, it matches only an identical constant. (2) If the 
condition element value is a variable, it will match any value. 
However, if a variable occurs more than once in a left-hand side, 
all occurrences of the variable must match identical values. (3) If 
the condition element value is preceded by a predicate symbol, the 
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working memory element value must be related to the condition 
element value in the indicated way. 


The right-hand side of a production consists of an unconditional 
sequence of actions which can cause input-output, and which are 
responsible for changes to the working memory. Three kinds of 
actions are provided to effect working memory changes. Make 
creates a new working memory element and adds it to working 
memory. Modify changes one or more values of an existing 
working memory element. Remove deletes an element from the 
working memory. 


2.2. The Rete Match Algorithm 

In this subsection, we describe the Rete algorithm used for 
performing the match-phase in the execution of production 
systems. The match-phase is critical because it takes 90% of the 
execution time and as a result it needs to be speeded up most. Rete 
is a highly efficient algorithm for match that is also suitable for 
parallel implementations. A discussion of alternative match 
algorithms can be found in [3]. The Rete algorithm gains its 
efficiency from two optimizations. First, it exploits the fact that 
only a small fraction of working memory changes each cycle by 
Storing results of match from previous cycles and using them in 
subsequent cycles. Second, it exploits the similarity between 
condition elements of productions (both within the same 
production and between different productions) to reduce the 
number of tests that it has to perform to do match. It does so by 
performing common tests only once. 


The Rete algorithm uses a special kind of a data-flow network 
compiled from the left-hand sides of productions to perform match. 
The network is generated at compile time, before the production 
system is actually run. Figure 2-2 shows such a network for 
productions p1 and p2, which appear in the top part of the figure. 
In this figure, lines have been drawn between nodes to indicate the 
paths along which information flows. Information flows from the 
top-node down along these paths. The nodes with a single 
predecessor (near the top of the figure) are the ones that are 
concemed with individual condition elements. The nodes with two 
predecessors are the ones that check for consistency of variable 
bindings between condition elements. The terminal nodes are at 
the bottom of the figure. Note that when two left-hand sides 
require identical nodes, the algorithm shares part of the network 
rather than building duplicate nodes. 


To avoid performing the same tests repeatedly, the Rete 
algorithm stores the result of the match with working memory as 
State within the nodes. This way, only changes made to the 
working memory by the most recent production firing have to be 
processed every cycle. Thus, the input to the Rete network 
consists of the changes to the working memory. These changes 
filter through the network updating the state stored within the 
network. The output of the network consists of a specification of 
changes to the conflict set. 


The objects that are passed between nodes are called tokens, 
which consist of a tag and an ordered list of working-memory 
elements. The tag can be either a +, indicating that something has 
been added to the working memory, or a —, indicating that 
something has been removed from it. No special tag for working- 


(p-pl "(CL “attxel <2> “act r2 12) (p-p2. (C2 *attrl 15 “attr2: <y>) 
(C2 “attri 15 “ators <x) (C4 *“attrl <y>) 
x {C3 “attri. <x) --> 
--> (modify 1 *attrl1 12)) 
(remove 2) ) 
root 
constant-| class=Cl class=C2 class=C4 
test 
nodes 
attr2=12 attr1l=15 


mem—-node 


mem~node 


pl 


Figure 2-2: 


memory element modification is needed because a modify is 
treated as a delete followed by an add. The list of working- 
Memory elements associated with a token corresponds to a 
sequence of those elements that the system is trying to match or 
has already matched against a subsequence of condition elements 
in the left-hand side. | 


The data-flow network produced by the Rete algorithm consists 
of four different types of nodes. These are: 


1. Constant-test nodes: These nodes are used to test if 
the attributes in the condition element which have a 
constant value are satisfied. These nodes always 
appear in the top part of the network. They have only 
one input, and as a result, they are sometimes called 
one-input nodes. 


2. Memory nodes: These nodes store the results of the 
match phase from previous cycles as state within 
them. The state stored in a memory node consists of 
a list of the tokens that match a part of the left-hand 
side of the associated production. For example, the 
right-most memory node in Figure 2-2 stores all 
tokens matching the second condition-element of 
production p2. 


3. Two-input nodes: These nodes test for joint 
satisfaction of condition elements in the left-hand 
side of a production. Both inputs of a two-input node 
come from memory nodes. When a token arrives on 
the left input of a two-input node, it is compared to 
each token stored in the memory node connected to 
the right input. All token pairs that have consistent 
variable bindings are sent to the successors of the 
two-input node. Similar action is taken when a token 
arrives on the right input of a two-input node. 


4. Terminal nodes: There is one such node associated 
with each production in the program, as can be seen 
at bottom of Figure 2-2. Whenever a token flows 
into a terminal node, the corresponding production is 
either inserted into or deleted from the conflict set. 


twoinp-node 


mem-node 


twoinp-node 
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mem-node 


class=C3 


mem—-node 


twoinp-node 


terminal-node 


terminal-node 


The Rete network. 


The most commonly used interpreter for OPSS is the Rete-based 
Franz Lisp interpreter. In this interpreter a significant loss in the 
speed is due to the interpretation overhead of nodes. In the OPS5 
implementation we present in this paper, the interpretation 
overhead has been eliminated by compiling the network directly 
into machine code. While it is possible to escape to the interpreter 
for complex operations during match or for setting up the initial 
conditions for the match, the majority of the match is done without 
an intervening interpretation level. This has led to a speed-up of 
10-20 fold over the Franz Lisp interpreter (see Table 4-4). In 
addition to this speed-up, our parallel implementation gets further 
speed-up by evaluating different node activations in the Rete 
network in parallel. 


2.3. Encore Multimax 

In this subsection, we describe the Encore Multimax shared- 
memory multiprocessor -- the computer system on which parallel 
OPS5 runs. The Multimax consists of 2-20 CPUs, each of which is 
connected to the shared-memory through a high performance bus. 
The shared-memory is equally accessible to all of the processors, 
in that each processor sees the same latency for memory accesses. 


The processors used in our Encore Multimax are National 
Semiconductor NS32032 chips along with NS32081 floating point 
coprocessors, each processor capable of approximately 0.75 
million instructions per second. There are two processors 
packaged per board and they share 32 Kbytes of cache memory. 
The processor boards use a combination of write-through strategy 
and bus-watching logic to keep the caches on different processor 
boards consistent. The bus used on the Encore Multimax is called 
the Nanobus. It is a synchronous bus and it can transfer 8 bytes of 
new information every 80 nanoseconds, thus providing a data 


transfer bandwidth of 100 Mbytes/second. 


The version of Encore Multimax available to us at CMU has 16 
processors, 32 Mbytes of memory, and runs the MACH operating 
system developed at Carnegie Mellon University. The operating 
system provides a UNIX-like interface to the user, although the 
internals are different and several extensions have been made to 
support the underlying parallel hardware. . It provides facilities to 
automatically distribute processes amongst the available processors 
and it provides facilities for multiple processes to share memory 
for communication and synchronization purposes. The results 
reported in this paper correspond to this configuration of the 
Encore Multimax. 


3. Organization and Details of the Parallel 

Implementation 

When studying parallelism in production systems (or in any 
other application for that matter), it is important to compute the 
speed-ups with respect to the performance of the most efficient 
uniprocessor implementations. It is indeed quite easy to obtain 
large speed-ups with respect to inefficient implementations of the 
application, but such results have little practical utility. In the case 
of OPSS, the most efficient uniprocessor implementations are 
currently based on the Rete algorithm and they compile the Rete 
network directly into machine code and use global register 
allocation. Such compilation into machine code _ gives 
approximately 10-20 fold speed-up over Rete-based lisp 
implementations of OPS5 (see Table 4-4). For this reason, our 
parallel implementation of OPS5 on the Encore is also Rete-based 
and compiles the Rete network directly into (NS32032) machine 
code.2 Another effect of parallelizing a highly efficient 
implementation versus an inefficient one is that the number of 
instructions executed in each parallel subtask (for the same task 
decomposition) is smaller in the highly efficient implementation. 
This is equivalent to exploiting parallelism at a finer granularity, 
and as a result, the issues of synchronization and scheduling are 
more critical. 


3.1. High-Level Structure of the Parallel Implementation 

The parallel OPSS implementation on the Encore (PSM-E) 
consists of one control process and one or more match processes. 
The number of match processes is a user specified parameter, but it 
is fixed for the duration of any particular run. The system is 
generally used in a mode where the computer contains at least as 
many free processors as there are processes in the matcher; this 
permits each process to be assigned to a distinct processor for the 
duration of the run (provided the operating system is reasonably 
clever about assigning processes to processors). 


The control process is responsible for performing conflict 
resolution, evaluating the right-hand side of rules, handling 
input/output, and all the other functions of the interpreter except 
for performing match. It is also responsible for starting up the 
match processes at the beginning of the run and killing them at the 
end of the run. The match processes do nothing except perform the 
match. The match processes pipeline their operation with the 
control process. Thus when RHS evaluation begins, the match 


2Note that the argument in the beginning of this paragraph does not say that one 
has to use the same algorithm (as the most efficient uniprocessor one) for the 
parallel implementation. It just turns out in our case, that the efficient 
uniprocessor algorithm is also very good for parallel implementation. 
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processes are idle. However, as soon as the first working memory 
change is computed, information about that change is passed to the 
match processes and they start to work. The control process 
continues evaluating the RHS, and as more changes are computed, 
the information is passed immediately to the match processes for 
them to handle as soon as they are able. When the control process 
finishes evaluating the RHS, it becomes. idle and waits for the 
match processes to finish. When the last match process finishes, 
the control process performs conflict resolution and then begins 
evaluating the next RHS, thus starting the cycle over again.> 


To perform match, the match processes use the Rete algorithm 
described in Section 2.2. The match processes exploit the 
dataflow-like nature of the Rete algorithm to achieve speed-up 
from parallelism. In particular, a single copy of the Rete network 
is held in shared memory. The match processes cooperate to pass 
tokens through the network and update the state stored in the 
memory nodes as indicated by the tokens. The match is broken 
into fairly small units of work called tasks, where a task is an 
independently schedulable unit of work that may be executed in 


parallel with other tasks. In our parallel implementation: 


e Small groups of constant-test node activations 
constitute a task. Multiple constant-test nodes are 
processed as a group, because individual constant-test 
node activations take only 3 machine instructions to 
execute, and that is too fine a granularity. 


e The memory nodes in the Rete network are coalesced 
with the two-input nodes that are below them. Each 
activation of these coalesced two-input nodes 
constitutes a single task. The reasons for this 
coalescing are discussed in [4]. As an example, the 
task corresponding to the left activation of a two-input 
node involves: (i) the addition/deletion of the 
incoming token to the left memory node; (ii) 
comparison of this token with all tokens in the 
Opposite memory node checking for consistent 
variable bindings; and (iii) scheduling of matching 
token pairs for execution as new tasks. Note that 
multiple activations of the same two-input node 
constitute different tasks and these can be processed in 
parallel. 


e Each individual terminal node activation constitutes a 
task. 


In our current implementation, each task is represented by a data 
object called a token. The token in the parallel implementation is 
essentially the same as that used in the sequential Rete matcher (as 
described in Section 2.2), except that it has two extra items of 
information: the address of the node to which the token is to be 
sent, and if that node is a two-input node, an indication of whether 
to send it to the left or right input. The list of tokens that are 
awaiting processing is held in a central data structure called a task 
queue. The individual match processes perform match by 
executing the following loop. 


For simplicity, we are ignoring two kinds of optimizations that are possible. 
First, it is possible to overlap conflict-resolution with match. Second, if 
speculative parallelism is used (we are willing to be wrong in our prediction 
sometimes and know how to recover from the error), it is possible to make a guess 
about the production that will fire next and to evaluate its right-hand side before 
conflict-resolution is completely finished. We choose to ignore these two 
optimizations for the present, because conflict-resolution and RHS evaluation are 
not the bottlenecks in our current implementation. 


conflict 
set 


match processes 


shared copy 
Of. the 
Rete network 


token memories 


shared memory 


Figure 3-1: Use of shared-memory by various processes. 


1. Remove a token from the task queue. If the queue is 
empty, wait until something is added. 


2. Process the token. If new tokens are to be sent out, 
push them onto the task queue. 


3. Go to step 1. 


3.2. Implementation Details 

All communication between processes (both the match processes 
and the control process) takes place via shared memory. The 
virtual address spaces are set up so that the objects in shared 
memory have the same virtual address in every process. Hence 
processes can simply pass pointers around in essentially the same 
way routines within a single process can. For example, the tokens 
are created in shared memory, and the address of a given token is 


the same in every virtual address space in the system. Thus when a 
process places a token onto the central task queue, all it really has 
to do is to put the address of the token into the task queue. Figure 
3-1 shows how the shared-memory is used to communicate 
between the various processes. 


Synchronization within the program is handled explicitly by 
executing interlocked  test-and-set _ instructions. The 
synchronization primitives provided by the operating system (for 
example, semaphores, barriers, signals, etc) are not used because of 
the large overhead associated with them. When a process finds 
that it is locked out of a critical region it spins on the lock, waiting 
for a chance to enter the region. In order to minimize the amount 
of bus traffic generated by the spinning processes, a "test and test- 
and-set" synchronization mechanism is used. In this scheme, a 
process uses ordinary memory-read instructions to test the status of 
a lock until it finds that it is free; then the process uses a test-and- 
set interlocked instruction to re-read the lock and set it (if it is still 
free). Note that while the lock is busy, the process spins out of its 
cache and does not use the bus. This is more efficient than using 
only the "test-and-set" interlocked instruction for the lock. In this 
case, the process generates bus traffic to perform the writes while it 
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is busy waiting. 


The control process communicates with the match processes 
primarily through the shared task queue. Whenever the evaluation 
of an RHS results in a change to working memory, a token is 
created and marked as being destined for the root node of the 
network. The control process pushes these tokens onto the task 
queue in exactly the same way as the match processes push the 
tokens they create. The tokens are picked up and processed by 
waiting match processes. When the evaluation of an RHS begins, 
the match processes are idle. The first token created by the control 
process causes the match processes to start up. After the first 
token, the control process proceeds in parallel with the match 
processes. 


Depending on the granularity of tasks (number of instructions 
executed per task) that are scheduled using the task queue and 
depending on the number of processors that are trying to access the 
task queue in parallel, it is quite possible that a single task queue 
would become a bottleneck. For this reason, Gupta [4] proposed a 
hardware task scheduler for scheduling the fine-grained tasks. So 
far we have not implemented the hardware scheduler, and in this 
paper we present results only for the case when one or more 
software task queues are used. 


After the control process finishes evaluating the RHS, it must 
wait for the match processes to finish before it can perform the 
next conflict resolution operation. A global counter, TaskCount, is 
used to determine when all the match processes have finished. 
This counter contains the sum of: 


e the number of tokens that are currently on the task 
queue, and 


e the number of tokens that are being processed by the 
match processes. 
This count is maintained quite simply. Every time a token is put 
onto the task queue, the counter is incremented. Every time a 
match process finishes working with a token, the counter is 


_ decremented. The match phase is finished when the counter goes 
to zero. 


Shifting our focus back to the evaluation of individual two-input 
node activations, we note that instead of having separate memories 
for each two-input node, the matcher has two large hash tables 
which hold all the tokens for the entire network. One hash table 
holds tokens for left memories of two-input nodes, and the other 
for right memories of two-input nodes. An alternative scheme is to 
have separate hash tables for each two input node, but such a 


scheme was considered to be wasteful of space. The hash function 
that is applied to the tokens takes into account: 


e the values in the token which will have equality tests 
applied at the two-input node, and 


e the unique identifier of the two-input node which 
stored the tokens. 
This permits the two-input nodes to locate any tokens that are 
likely to pass the equal-variable tests quickly. It also permits 
multiple activations of the same two-input node to be processed in 
parallel. 


The processing performed by the individual node activations in 
the parallel implementation is similar to the processing done in the 
sequential matcher with two exceptions: 


e Code has been added to the two-input nodes to handle 
conjugate token pairs. 


e Sections of code that access shared resources are 
protected by spin locks to insure that only one process 
at a time can be accessing each resource. 


A conjugate pair is a pair of tokens with opposite signs (an add 
token request and a delete token request), but which refer to the 
same working memory element or list of working memory 
elements. Conjugate pairs arise in the match operation for a 
variety of reasons, which are too complex to go into here (see [4]). 
They occur in both sequential and parallel implementations of 
Rete, but they present much greater problems in a parallel system. 
The reason for this is that in a parallel system it is not possible to 
insure that the tokens will be processed in the order in which they 
are generated, and consequently in some cases a token with a — 
(delete) flag will arrive at a two-input node before the 
corresponding token with the + (add) flag. The parallel matcher 
code handles this by saving the — tokens that arrive early on an 
extra-deletes-list without otherwise processing the token. When 
the corresponding + token arrives both tokens are discarded. 


Many resources in a parallel system have to be protected with 
mutual-exclusion locks -- the task queues, the count of the number 
of active tokens, the conflict set, etc. Most of these are relatively 
straight-forward to protect and a simple variation of standard spin 
locks is used. The exception is the locks used to control access to 
the token hash tables. There are several different operations that 
are performed on the token hash tables, for example, searching for 
matching tokens, adding and removing tokens, adding and 
removing conjugate tokens, and we would like many of these 
operations to proceed in parallel without having any undesirable 
effects. Because of the importance of the hash tables to the 


performance of the system, several locking schemes were 
implemented and tried. Two of these schemes are described here. 
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The first scheme, the simple one, is easy to describe and it 
provides a departure point for describing the second more complex 
one. We define a "line" as a pair of corresponding buckets 
(buckets with the same hash index) from the left and right hash 
tables along with their associated extra-deletes lists. In this 
scheme, each line in the hash table has a flag controlling its use.* 
The flag takes on two values: Free and Taken. When a process has 
to work with the hash table, it examines the flag for the line it 
needs. If the flag is Free, it sets the flag to Taken and proceeds to 
perform the necessary operations; when it finishes, it sets the flag 
back to Free. If a process finds the flag set to Taken, it waits until 
the flag is set to Free. Of course, the act of testing and setting the 
flag must be an atomic operation. This synchronization scheme 
works, but it is a potential bottleneck when several tokens arrive at 
a node about the same time, and if all of them require access to the 
same hash table line. 


The second scheme is a complex variant of the 
multiple-reader-single-writer locking scheme. It permits several 
tokens to be processed in the same line at the same time, though 
even here, some serialization of the processing is necessary when 
destructive modifications to the lists of tokens are performed. This . 
scheme requires two locks, a flag, and a counter for each line in the 
hash table. The flag takes on three values: Unused, Left, and Right, 
to indicate respectively that the line is not currently being 
processed, that it is being used to process tokens arriving from the 
left, or that it is being used to process tokens arriving from the 
right. The counter indicates how many processes are using that 
line in the hash table; it is needed only so that the last process to 
finish using the line can set the flag back to Unused. The first lock 
insures that only one process at a time can access the flag and the 
counter. When a process first tries to use a line in the hash table, it 
gets this lock, and checks the flag. If the flag indicates that tokens 
from the other side are being processed, the process releases the 
lock and put the token back onto the task queue. If the flag allows 
the process to continue, it sets the flag if necessary, increments the 
counter, and releases the lock. For the remaining time that the 
process uses this line in the hash table, it leaves the flag and the 
counter untouched; finally, when the process finishes using the line 
it decrements the counter and if appropriate sets the flag to Unused 


(again, all within a section of code protected by this lock). All this 
is to insure that tokens from two different sides are not processed at 
the same time. The second lock is used to insure that only one 
process at a time can be modifying the token lists. Recall that the 
first task in processing a two-input node is to update the list of 
tokens stored in the memory node. To do this, the process gets the 
modification lock, searches the conjugate or regular token list, and 
it either adds the token to or deletes it from one of these lists. 
When it has finished, it releases the modification lock and proceeds 
with searching the tokens in the opposite hash-table bucket to find 
those that satisfy the variable binding tests. 


More complex locking schemes can be devised and, in fact, were 
implemented and tested. One other scheme that was tried 
permitted more than one process to be searching the token lists to 
find tokens to delete; in this scheme the only serialization of the 
tasks occurred when the actual destructive modification of the 
token list was performed. As in all implementations, the main 


4Note that any given operation on the token hash tables requires access to only 
a single line of the hash tables. In other words processing a single node activation 
never requires access to multiple hash table lines. 


Table 4-1: Uniprocessor versions on Microvax-II. 
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List-based 

memories 
(sec) 


PROGRAM 


tradeoff to keep in mind is that in an attempt to speed-up the rare 
cases, one should not slow-down the normal case. 


3.3. RHS Evaluation and Conflict Resolution 

In our system, the rules’ RHSs are compiled into a form of 
threaded code which is interpreted at run time [8]. Interpreting the 
threaded code is slower than executing the compiled code, but 
since RHS evaluation is not a bottleneck to the performance, 
threaded code, which is simpler to compile was considered fast 
enough. Conflict resolution in the system is handled by code 
written in the C language. This code is executed by the control 
process. 


4. Results 
We present results for the parallel execution of three production- 
system programs in this paper. These are: 


e Weaver [7], a VLSI routing program by Rostom 
Joobbani with 637 rules. 


e Rubik, a program that solves the Rubik’s cube. by 
James Allen with 70 rules. 


e Tourney, a program that assigns match schedules for a 
tournament by Bill Barabash from DEC with 17 rules. 


We have chosen Weaver because it represents a fairly large 
program and it demonstrates that our parallel OPS5 can handle real 
systems. Rubik is a smaller program that demonstrates some of the 
strengths of our parallel implementation and the Tourney program 
demonstrates some of the weaknesses of our parallel 


implementation. 


4.1. Results ior the Uniprocessor Implementations of 
OPS5 

Before we did a parallel implementation on the Encore, we 
initially did several uniprocessor C-based implementations of 
OPSS. In this subsection, we present results for two of these 
uniprocessor implementations, vs1 and vs2, for the Microvax-II 
workstation.° The performance results for vsl and vs2 
implementations are shown in Table 4-1. The base version is vs1, 
and it is characterized by the use of linear lists to store tokens in 
node memories, just as uniprocessor lisp implementations do.® 


The results are presented for Microvax-II and not for Encore, because the 
uniprocessor implementations were done on the Microvax and only one of these 
__ was later taken over to the Encore. 


Note that memory nodes are not shared in either vs1 or vs2 versions of OPSS5, 
unlike in the Franzlisp version of OPS5. This optimization was not used in vsl or 
vs2 because it is not possible to share memory nodes in the parallel 
implementations of OPS5 (see [4]), and we did not want to spend the effort just 
for the uniprocessor implementations. 
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Total number 
of node 
activations 


Total number 
of WM-changes 
processed 


8350 554051 


The second version, vs2, uses a global hash table to store all 
memory-node tokens, as discussed in the previous section. If there 
are equality tests at the two-input node, the hash-table based 
scheme (i) reduces the number of tokens that have to be examined 
in the opposite memory to locate those that have consistent 
variable bindings, and (ii) for deletes, it reduces the number of 
tokens that have to be examined in the same memory to locate the 
token to be deleted. The statistics for the reduction in tokens 
examined in the opposite memory for the three programs are given 
in Table 4-2. Note the statistics are computed only for those node 
activations where the opposite memory is not empty. The statistics 
for the reduction in tokens examined in the same memory for 
delete requests are given in Table 4-3. As can be seen from the 


two tables, the savings are substantial, especially for the Tourney 
program. The time-saving effect of hash-based memories can be 
seen from numbers in Table 4-1. 


The second last column in Table 4-1 gives the total number of 
wme-changes processed during the run for which data are 
presented, and the last column gives the total number of node 
activations processed during the run (this is also equal to the 
number of tasks that are pushed/popped from the task queue in the 
parallel version). Dividing the time in column vs2 by the number 
of tasks, we get the average duration for which a task executes. 
This has implications for the amount of synchronization and 
scheduling overhead that may be tolerated in the parallel 
implementation. Doing this division we get that the average 
duration of a task for Weaver is 230 microseconds (or 
approximately 115 machine instructions, as the VAX executes 
about 500,000 instructions per second), for Rubik is 175 
microseconds, and for Tourney is 1300 microseconds. 


Finally, Table 4-4 gives the speed-up that our uniprocessor C- 
based implementation achieves over the widely available 
Franzlisp-based OPS5 implementation when running on the 


Microvax-II. As the table shows, we get a speed-up of about 10-20 
fold over the Franzlisp based implementation. The problem in the 
past has been that due to lack of availability of better uniprocessor 
performance numbers, researchers have ended up comparing the 
performance of their highly optimized parallel OPS5 
implementations with the slow Franzlisp-based implementation. 
We think that such apples to oranges comparison can be 
misleading, and we hope that in the future the performance of 
parallel implementations would be compared to the performance of 
this optimized uniprocessor implementation. 


Tokens in opp mem | Tokens in opp mem 
for left actvns 


Table 4-2: Number of tokens examined in opposite memory. 
for right actvns 
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Table 4-3: Number of tokens examined in same memory for deletes. 
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Table 4-4: Speed-up of C-based over Franzlisp-based implementation 
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4,2. Results for the Multiprocessor Implementation of 
OPS5 
While the uniprocessor C-based implementations of OPS5 were 
done on the Microvax-II, the parallel version was done on the 


Encore Multimax multiprocessor. In this section, we present 
speed-up numbers for our implementation on the Encore and the 
results of our experiments as we varied (i) the number of task 
queues that were used and (ii) the locking structures used for token 
hash-table buckets. 


Table 4-5 shows results for the case when a single task queue is 
used and when simple locks (described in Section 3.2) are used 
with the token hash-table buckets. The first column simply gives 
the name of the programs. The second column gives the time 
taken to do the match when only one process is used (time for 
conflict-resolution and RHS evaluation is not included). The 
timing numbers in the second column correspond to version vs2 
discussed earlier. The numbers here are larger than the 
corresponding numbers in Table 4-1 because the NS32032 
processor used in Encore is slower than the Microvax-II processor 
and because of the presence of extra synchronization and 
scheduling code in the parallel implementation. The numbers 
given in the remaining columns are speed-up figures with respect 
to the time given in the second column. The number of processes 


PROGRAM 
[tin men [hash mem} 1in mem [hash mem 
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used in the parallel match are given in the second row from the top 
in the table. These numbers are expressed as "1+k", where the "1" 
indicates the control process and the "k" indicates the number of 
match processes. The third row from the top indicates the number 
of task queues used, which is one for all entries in this table. 


In Table 4-5, the speed-up for the case when number of 
processes is "1+1" is in two cases greater than one. This is because 
the set of node activations is different when the RHS evaluation 
and match are proceeding in parallel (even though match is being 
done by only one process), as compared to the case when match 
does not start until RHS evaluation is completely finished. The 
speed-up with multiple match processes is also quite disappointing 
for all three programs and especially for Tourney. Possible reasons 
are: (i) contention for access to the single task queue, (ii) 
contention for access to the hash-table buckets, and (iii) low 
intrinsic parallelism in the programs. We now explore the effects 
of removing the first two bottlenecks by using multiple task queues 
and by using more complex hash-table locking schemes. 


Table 4-6 presents results for the case when multiple task queues 
are used, while retaining simple hash-table locks. The speed-up 
increases significantly for both Weaver and Rubik, indicating that 
the contention for pushing and popping task queues must have 
been a bottleneck. The speed-up for Weaver for 1+13 processes 
goes up from 3.9-fold to 8.2-fold and that for Rubik goes up from 
6.3-fold to 11.4-fold. The speed-up for Tourney remains about the 
same at 2.4-fold. To get more insight into these results, we 
instrumented the task queue to get data on contention. The results 
are shown in Table 4-7. The table shows the contention among the 
processes for the centralized task queue as the number of match 
processes is increased. We see from the table that as the number of 
processes is increased, there is indeed significant contention for the 
single task queue in case of Weaver and Rubik. For Tourney, there 


does not seem to be significant contention for the task queue, and 
that is why the speed-up does not increase when multiple task 
queues are used. The contention numbers drop from 24.62, 26.89, 
and 8.93 for single task queue to 4.85, 6.12, and 4.75 for eight task 
queues for Weaver, Rubik, and Tourney respectively when 13 
match processes are used. 


Examining the speed-up for Rubik in Table 4-6, it is interesting 
to note that we get 3.9-fold speed-up for Rubik using only 3 match 
processes. In this case, the speed-up is larger than the number of 
match processes because when the Rete network is evaluated in 
parallel, it is quite possible that the total number of node 
activations evaluated and their complexity is less than that in the 
sequential implementation. Of course, the final result of the match 
is still the same as the sequential implementation. 


In Table 4-8 we present results for the case when multiple task 
queues are used and when complex multiple-reader-single-writer 
locks (described in Section 3.2) are used for controlling entry to the 
token hash tables. We expected the complex locks to benefit those 
programs that (i) generate cross-products, that is, there are multiple 
activations of the same two-input node from the same side that 
need concurrent processing, and (ii) have long lists of tokens in 
hash-table buckets, where the complex locks help by allowing 
multiple processes to read the opposite memory at the same time. 


Table 4-5: Speed-up for single task queue and simple hash-table locks. 
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Table 4-7: Contention for the centralized task queue. Measured by the number 
of times a process spins on the lock before it gets access to the task queue. 
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However, programs for which the above two conditions are not 
true may slow down when complex locks are used, because of the 


extra overhead that they incur due to complex locks. Table 4-9 
presents some results about contention when simple locks are used 
versus contention when complex locks are used. We see that the 


contention for the hash-table buckets decreases for all three 
programs when complex locks are used, although the increase in 
speed-up is not much. However, Table 4-9 does give an indication 


pus | oaet | aan | as 
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11.58 | 20.05 


as to why we are getting very poor speed-up for the Tourney 
program. The poor speed-up for the Tourney program is due to the 
large contention for the hash-table locks resulting from multiple 
node activations trying to access the same hash-table bucket. This, 
in turn, is the result of a few culprit productions in Tourney that 
have condition elements with no common variables. By modifying 
two such productions using domain specific knowledge, we could 
increase the speed-up achieved using 1+13 processes from 2.7-fold 
to 5.1-fold. 


Table 4-9: Contention for token hash-table locks. Measured by the number of 
times a process spins on a lock before it eee access to the hash-table bucket. 


| 6 processes: | 6 processes: 


5. Conclusions 

In this paper we have presented the details of a parallel 
implementation of OPSS running on the Encore Multimax. The 
first observation is that it is important to speed-up an optimized 
sequential implementation, otherwise most of the benefits are lost. 
For example, speeding-up the Franzlisp implementation by 10-20 
fold from parallelism just brings us to the uniprocessor speed of the 
C-based implementation. Furthermore, the issues in parallelizing 
an optimized implementation are different from those in an 
unoptimized implementation, because only very limited overheads 
can be tolerated in an optimized implementation. 


The second observation we make is that it is possible to obtain 
significant speed-ups for OPS5 using fine-grained parallelism on a 
shared-memory multiprocessor. We get up to 12.4 fold speed-up 
for Rubik using 13 match processes. However, this does not work 
for all programs. The Tourney program, because of the presence 
of cross-products [4], resisted all our attempts to obtain higher 
speed-up. The average length of the individual tasks in our parallel 
implementation varies between 100-700 machine instructions for 
the three programs that we studied. In trying to exploit this fine- 
grained parallelism, we found that the scheduling of tasks on 
processors was a major bottleneck. We found it essential to use 
multiple task queues (instead of a single task queue) to obtain 
reasonable speed-up. For the Rubik program, going from one task 
queue to multiple task queues increased the speed-up from 6.3-fold 
to 11.4-fold. 


The other variation that we explored to reduce the contention for 
shared data structures was in the complexity of locks used for 
hash-based memory nodes. We used both simple spin-locks and 
complex multiple-reader-single-writer locks. We observed that 


special note must be taken of rare-case versus normal-case 
execution. Trying to handle rare cases efficiently can slow down 
the normal case, and can result in overall poorer performance. For 
example, the provision of complex hash-table locks reduced the 
contention for the hash-table buckets, but it slowed down the 
overall execution speed of the programs. 


6. Acknowledgments 

This research was sponsored by the Defense Advanced Research 
Projects Agency (DOD), ARPA Order No. 4864, monitored by the 
Air Force Avionics Laboratory under Contract NO0039-85-C-0134 
and by the Encore Computer Corporation. Anoop Gupta is also 


supported by DARPA contract MDA903-83-C-0335 and an award - 


from the Digital Equipment Corporation. 


contention | contention with simple locks | | contention with simple locks | locks 


12 processes 


PROGRAM 
pee [erp fazpe fee fea] 2 
Sc EI 
Pawer Peale pofee [oe [oe [afm 


280 


contention with mrsw locks 


6 processes 12 processes | 


References 


1. Lee Brownston, Robert Farrell, Elaine Kant, and Nancy Martin. 
Programming Expert Systems in OPS5: An Introduction to Rule- 
Based Programming. Addison-Wesley, 1985. 


2. Charles L. Forgy. The OPS83 Report. Tech. Rept. CMU- 
CS-84-133, Carnegie-Mellon University, Pittsburgh, May, 1984. 


3. Anoop Gupta, Charles Forgy, Allen Newell, and Robert Wedig. 
Parallel Algorithms and Architectures for Production Systems. 
13th International Symposium on Computer Architecture, June, 
1986. 


4. Anoop Gupta. Parallelism in Production Systems. Ph.D. Th., 
Camegie-Mellon University, March 1986. Also available from 
Morgan Kaufmann Publishers Inc.. 


5. Anoop Gupta, Charles Forgy, Dirk Kalp, Allen Newell, and 
Milind Tambe. Parallel Implementation of OPS5 on the Encore 
Multiprocessor: Results and Analysis. To appear in International 
Journal of Parallel Programming. 


6. Bruce K. Hillyer and David E. Shaw. "Execution of OPS5 
Production Systems on a Massively Parallel Machine". Journal of 
Parallel and Distributed Computing 3 (1986), 236-268. 


7. Rostam Joobbani and Daniel P. Siewiorek. Weaver: A 
Knowledge-Based Routing Expert. Design Automation 
Conference, 1985. 


8. Peter M. Kogge. "An Architectural Trail to Threaded-Code 
Systems". Computer March (1982). 


9. Edward J. Krall and Patrick F. McGehearty. "A Case Study of 
Parallel Execution of a Rule-Based Expert System". International 
Journal of Parallel Programming 15, 1 (1986), 5-32. 


10. Theodore F. Lehr. The Implementation of a Production 
System Machine. Hawaii International Conference on System 
Sciences, January, 1986. 


11. Naniel P. Miranker. TREAT: A New and Efficient Algorithm 
for AI Production Systems. Ph.D. Th., Columbia University, 1987. 


12. Kemal Oflazer. Parallel Execution of Production Systems. 
International Conference on Parallel Processing, IEEE, August, 
1984. 


13. Raja Ramnarayan, Gerhard Zimmerman, and Stanley 
Krolikoski. PESA-1: A Parallel Architecture for OPS5 Production 
Systems. Hawaii International Conference on System Sciences, 
January, 1986. 


14. M.F.M. Tenorio and D.I. Moldovan. Mapping Production 
Systems into Multiprocessors. International Conference on 
Parallel Processing, IEEE, 1985. 


A Taxonomy of Synchronous Parallel Machines 
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Abstract: A new classificational scheme is presented which 
is consistent with Flynn’s taxonomy but is more expres- 
sive. The crucial idea is to recognize that a reference 
stream is composed of both values and addresses; their 
treatment exposes critical features of an architecture. This 
insight, together with the accompanying formal mecha- 
nism built on top of it, enables a large variety of recently 


developed (since Flynn’s work) machines to be distinguished, 


including VLIW, multigauge, systolic arrays, and the Con- 
nection Machines. Though the resulting taxonomic struc- 
ture is illuminating, the most important result of the clas- 
sification is the discovery that synchronous execution is 
NOT a defining property of computer architectures, but 
is a derived property, a consequence of other architectural 
features. The evidence for this result and the consequences 
for machine classification are presented. 


1 Introduction 


In 1966 Flynn [1] introduced his classification of comput- 
ers. This taxomony proved to be very useful, giving us 
terminology like SIMD and MIMD that endures to this 
day. The taxonomy, however, has long been described as 
too coarse, unable to distinguish between computers that 
seem to computer architects to be quite different. Though 
other classifications have been offered [2-5], the fact that 
Flynn’s classification has lasted for so long without being 
replaced and enhanced is a testament to the difficulty of 
discovering something better. 

In this paper a new taxonomy is presented for syn- 
chronous parallel computers. It has no pretentions of be- 
ing complete nor of capturing all features of synchronous 
parallel computers. The taxonomy does clarify important 
distinctions among recently developed parallel computers, 
such as the VLIW machines, multigauge machines and cer- 
tain SIMD machines such as the Connection Machines. 

The key idea of the taxonomy is to quantify the compo- 
nents of the fetch/execute cycle that process I-streams and 
D-streams. To make fine distinctions among machines, one 
must separate these reference streams into their address 
and value components, because addressing and value pro- 
cessing are crucial features by which machines differ. 

Using this kind of analysis, a taxonomy is constructed. 
Many of the machines that are placed into different classes 
here would have been classified by Flynn’s scheme as SIMD 
so this approach permits finer distinctions to be made. 
Only a small number of classes have been described, and 
only one or two machines per class have been identified. 
Thus, there remain substantial opportunities for further 
research. 

1 This research funded in part by the Office of Naval Research Con- 


tract N00014-86-K-0264, National Science Foundation Grant CCR- 
8416878 and Air Force Office of Scientific Research Contract 88-0023. 
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Perhaps the most important result derived from the 
taxonomy concerns the property of “synchroneity”. The 
author and apparently many other researchers have treated 
synchroneity as a primary classificational property; we have 
spoken of “the synchronous vs. the asynchronous” ma- 
chines as if this should be an important way to distinguish 
between machines. It is not. The criterion used for clas- 


sifying machines in this taxonomy tells when a machine 
must have all of its instructions start at the same time, 
and when it is not necessary. This determination is based 
on how the machine addresses and processes instructions 
and data. Machines which must begin all instructions 
at the same time will automatically be synchronous; for 
those machines that need not begin their instructions at 
the same time it is an “engineering decision” whether to 
make them synchonous or asynchronous. Thus the qual- 
ity of being synchronous is a derived property: A machine 
must have it because of other features, or it is a noncritical 
implementation feature. 


2 Preliminaries 


A reference stream, S, of a computer is a finite set of infinite 
sequences of pairs, 


S={(a<t >)(a9<te>)..., 
(by < ty >)(bg << ug >)...,. - ., 
(cy < v1, >)(cg < v2 >).. .} 


the first component of each pair being a nonnegative in- 
teger, called an address, and the second component being 
an n-tuple of nonnegative integers, called values, such that 
n is the same for all tuples of all sequences. An element 
of a reference stream is called a reference sequence. An 
I-stream is a reference stream whose values are interpreted 
as instructions, a D-stream is a reference stream whose 
values are interpreted as data. 

The interpretation of these definitions is simple. The 
elements of reference sequences are address, value pairs, 
the values simply being the contents fetched from (or stored 
to) memory at the address. A sequence of elements can 
be thought of as the history of the addresses and values 
moving between a processor and its memory space. An 
I-stream is made up of a finite set of these sequences, the 
number depending on how many instruction sequences the 
machine can process at one time; and a D-stream is made 
of a finite set of data sequences, the number depending on 
how many distinct operations the machine can perform at 
one time. 

Although the I-streams and D-streams have been de 
fined in an intuitive manner, their form is not convenient 


for analysis. Accordingly, the following reassociation must 
be performed. Let 


S={ (a <t, >)(a,<t2>)..., 
(b < uy >)(b2< ue >)... eg, 
(cr < v1 >)(e2 <2 >)... .} 


be a reference stream. Define two sequences: The address 
sequence of S, denoted Sq, is a sequence whose ith element 


th 


is a tuple formed from the addresses from the 2 
of each sequence of S, 


elements 


Sq =< aybicy ys a2bec2 re 


and the value sequence of S, denoted Sy, is the sequence 
whose it? 
value tuples from the 7 
quence of S, 


th elements of each reference se- 


Sy =< t1U1 V1 es tgUgVo Brig Ra 


Notice that although a reference stream is a set of se- 
quences, address and value sequences are just sequences 
of tuples. 

It is possible to interpret these definitions as grouping 
the corresponding addresses and corresponding value tu- 
ples of a reference stream S into Sg and Sy, respectively. 

Let Sz be a sequence of n-tuples; the width of the se- 
quence, w(Sr) =n. 

Proposition 1: Let S' bea reference stream with n-tuple 
values, then 


w(Sq) =| S| and w(Sy)=n|S | 


where | X | denotes the cardinality of the set X. 

A computation is a pair (1,D), where I is an I-Stream 
and D is a D-stream. Computers are classified by the com- 
putations they execute. A computer ezecutes the computa- 
tion (I,D) provided it presents w(Jq) instruction addresses 
to memory to be fetched simultaneously, it decodes and 
interprets w(Jy) instructions simultaneously, it presents 
w(Da) operand addresses to memory simultaneously, and 
it performs w(Dy) operations on distinct data values si- 
multaneously. The computer is described by the notation 


!w(Iq)w(Iy) Pw(Da)w(Dy): 


Notice that we speak of the computation executed by a 
computer. This is a definitional simplification, and is suf- 
ficient since any desired sequence of instructions or data 
is a subsequence of the infinite streams of the computa- 
tion. Observe the relationship between this point and the 
Enumeration Theorem of recursive function theory. 

Let dy, d2,d3 and d4 be predicates called class desig- 
nators, then a machine is said to be member of the class 
denoted by 


LdydxP dda 


if and only if dy(w(Ja)), do(w(Iv)), ds(w(Da)), and d4(w(Dy)). 


(Commas may occasionally be inserted between the sub- 
scripts for clarity.) 


element is a tuple formed by concatenating the 


Example 2: By appropriating for our class designators 
Flynn’s “s” and “m” to denote the predicates “is-one” and. 
“is-many’, it is possible to classify some familiar machines 
using the mechanisms developed so far. 

Let a von Neumann machine, which Flynn classified as 
SISD, execute the computation (1,D). From his classifica- 
tion we have 


[2 (=| D |=1. 
By Proposition 1, then, we have 
w(Iqg) = w(Da) = 1. 


Moreover, since instructions are decoded serially, w(ly) = 
1 and since they are executed serially, w(Dy) = 1. There- 
fore the von Neumann machine is described as 


fyiD41. 
It is classified with the present notation as 
Iss Des 


since the predicate “s”is true for all four widths. 


Now consider two machines that Flynn’s taxonomy lumped 


in the SIMD category, the MPP and the Illiac IV. (Ignore 
for the moment the fact that these have bit serial and word 
parallel PEs, respectively.) The single instruction stream 
means | J | = 1 for both machines. By the same reasoning 
just used for the von Neumann machine, the instruction 
streams for both machines are described as Iss. 

For data, consider the MPP first. Recall that the MPP 
controller broadcasts the same data memory address to all 
PEs [6], and so the machine has a single D-stream in our 
terminology; | D | = 1. However, a value is fetched from 
each PE memory, so the values of this stream are 16384- 
tuples. Thus, 


w(Dq) = 1 and w(Dy) = 16384, 


which certainly satisfies the “multiple” class designator. 
So, the MPP is described as 


ty 1P1, 16384 
and is classified as 


IssDsm.- 


‘The MPP has a “multiple data stream” but the multiplic- 


ity applies only to the data values, not to the data value 
addressing. 

For the Illiac IV on the other hand, the controller 
broadcasts a base address to all PEs, each of which may 
produce its own address by adding in the contents of a 
local index register [7]. This means that | D | = 64; there 
are 64 operand address streams simultaneously produced 
by the machine and each of them references a single value, 
i.e. each data address is associated with a 1-tuple. Ac- 
cordingly, | 


w(Da) = w(Dy) = 64 


and the Illiac IV is described as 


Ii 1 D64,64 
which places it in the 


IssDmm 


class. It has “multiple data streams” too, but its mul- 
tiplicities are for addressing and data reference. Clearly, 
the present taxonomy retains the distinctions achieved by 
Flynn, but it is also capable of making finer distinctions. 


3 Discussion 


It is possible to give an intuitive interpretation to much 
of the foregoing formalism. The key idea is to recognize 
that the formalism quantifies funtional components of a 
fetch/execute cycle. Thus, the machine described as 


TayvDay' 


presents instruction addresses to memory for a threads of 
control (presumably from a PCs but data flow computers 
qualify as well); it receives v different instructions back 
from memory at once and interprets them; it presents a’ 
different operand addresses to memory for data values, and 
it receives vu data values back and operates upon them 
concurrently. So, when the MPP is described as 


I 11, 16384 


it is immediately obvious that its PEs all use the same 
address for accessing their operand values, even though 
they are capable of independently performing operations 
on the resulting data. 

The interpretation of the classification is intended to 
carry the implication that if the n-tuple of values < t; > 
is received from memory upon presentation of address a; 
then the machine is capable of processing all n elements 
at once. This applies to both instructions and data. So 
even if a computer makes a memory reference to address 
a; and fetches k words, perhaps to cache them, if it only 
processes one of them, then n = 1 in this model. 

Finally, notice that our classificational scheme is a com- 
pletely formal system with a precise meaning. Its utility in 
classifying computers depends entirely on our interpreting 
this formalism as meaningful. Though it is possible for two 
scientists to differ in their interpretation, and thus to differ 
on a Classification, the underlying scheme is unambiguous. 


4 Properties of Address and Value 


Sequences 


To simplify discussing computer families, it is convenient 
to adopt a simple abbreviation. The expression 


Vd,d,P dsda 
will be abbreviated by the string 


dy dy d3d4 . 


Thus, the von Neumann machine class is abbreviated ssss, 
while the MPP is in sssm. String expressions will be used 
as shorthand to abbreviate several classes. 

There are several important properties of this taxo- 
nomic system which influence the kind of machine classes 
definable. 

Proposition 8: Any machine Igy Da} satisfies the in- 
equalities: 


a<vanda' < v’ 


These inequalities follow from the fact that in a reference 
stream every address is paired with at least one value, so 
the width of the address stream is a lower limit to the width 
of the corresponding value stream. The interpretation of 
these inequalities seems intaitively correct: The number 
of addresses presented to memory should never exceed the 
number of values returned. As a corollary, any nonempty 
machine class will satisfy these relationships, where the 
definition of the relation is suitably extended. 


Convention 4: Any machine IgyDdy/ will satisfy the 
inequality: 


v<v! 


Unlike the preceding propositions which are artifacts of 
the taxonomy’s abstraction, this convention is adopted pri- 
marily for semantic consistancy. Its interpretation is that 
the number of instructions being interpreted should not 
exceed the data available. 

Since it is a convention, it is open to debate. On the 
positive side the convention helps avoid “problem” ma- 
chines like Flynn’s MISD; this machine doesn’t make much 
sense and has often been criticized. Here, the convention is 
worthwhile, considering that the finer control of this tax- 
onomy permits greater opportunities to create such du- 
bious classes. On the negative side, adopting the conven- 
tion might prevent accurately describing certain machines, 
though none has come to the author’s attention. Since a 
taxonomy is descriptive (as opposed to being prescriptive) 
and given that architects are not likely to have their cre- 
ativity constrained by this convention, we adopt it. 


5 More Machine Classes 


The efficacy of a classificational system usually depends to 
some degree on interpretation. (It always does in biological 
taxonomy.) Usually there is a large range (sometimes a 
continuum) of values that a property can assume, and we 
wish to assign certain segments of this range to different 
classes. But there may not be any effective way to identify 
the boundaries of these ranges, and so membership is often 


a matter of judgement. This characteristic will persist for 


this taxonomy, but confusion can be minimized by being 
somewhat more precise about the terminology that we’ve 
already used. 

Define the class designators as follows: 


e sis the predicate “equals 1”, 


® cis the predicate “from 1 to some (small) constant”, 
and | | 

e m is the predicate “from 1 to an arbitrarily large 
finite number” 


Though the ¢ and m designators have no upper limit in 
principle, they are intended to convey two different mean- 
ings. When the c designation is used the range has a hard 
upper limit usually due to internal constraints in the ar- 
chitecture and cannot be easily increased by a substantial 
amount. An example might be the number of instructions 
that can be packed in the instruction word of a VLIW 
machine[8]; for any given word size it is fixed, and even 
though the word size can be increased this is probably not 
the intended nor the rational way to generalized the given 
machine. The m designation, however, is used when the 
quantity can be easily generalized or scaled. An example 
is when additional PEs can be added. as with the MPP. 

These distinctions are not always clear, of course, and 
judgement must be applied. An example is the question 
of how to classify a machine with processors connected 
to a bus [9]. In principle, there is no limit to how many 
processors can be attached to a bus, but with the addition 
of each processor the congestion increases, and this is an 
internal constraint reducing the performance. Is this a “c” 
or “m” case? Arguments can be made on both sides; we 
leave the question open for the moment. 

It is now possible, using the class designators, Propo- 
sition 3 and the convention to define a number of machine 
classes. Notice that there is no attempt to be complete in 
either defining classes or categorizing machines: 


von Neumann machines. 

“packed” von Neumann|10]; the machine can 
fetch several distinct data values from fixed 
postions from one address and simultaneously 
apply the same operation to them. Many ma- 
chines have some instructions of this form, e.g. 
performing 2 half word adds on the word at a 
given address; all (ALU) instructions for a ma- 
chine in this class would have this capability. 
SIMD Parallel Machines with no addressabil- 
ity, such as the MPP, the Connection Machine 
1 [11] and systolic arrays [12]. 

SIMD Multigauge machines [13]; these are 
von Neumann machines which can (option- 
ally) split their datapath to process multiple, 
independent operand streams at once. 
Addressable SIMD Parallel Machines, such as 
Iliac IV and the CM2 [14]. 


VLIW Machines [8]; the machine fetches and 
executes several instructions stored in one in- 
struction address. 

MIMD Multigauge machines [13]; these are 
von Neumann machines which can (option- 
ally) split their fetch/execute cycle to pro- 


IssDss 
IssDse 


IssDsm 


IssDec 


IssDmm 


IsceDee 


LecDee 


cess multiple, independent instructions con- 
currently. 


ImmDmm MIMD Parallel Machines, including machines 
such as the Ultracomputer[15] and the Cosmic 
_ Cube[16]. 
Clearly, the list is not complete in terms of either the 
classes listed or the machines recognized as members of 
any given class. Much work remains. 


6 Discussion of the Taxonomy and 
The Origins of Synchronous Com- 
putation 


One is struck by at least two aspects of the foregoing clas- 
sification: A large and diverse set of machines are lumped 
into the last classification, mmmm, and nowhere in the tax- 
onomy has the synchronous requirement been mentioned, 
except in the paper’s title. These two observations are 
related. | 

In effect the taxonomy uses as its “criterion for classi- 
fication” the number of repeated instances of the principle 
functional activities of the fetch/execute cycle. So, ma- 
chines are distinguished by how many instructions they 
can decode at once or how many operations on separate 
data they can perform simultaneously. But these are not 
the characteristics we think of as distinguishing the differ- 
ent MIMD parallel computers. Rather, we think of them 
as being different depending on whether or not they have 
global shared memory or what their interconnection topol- 
ogy is. These are features unrelated to the fetch/execute 
cycle. So, lumping MIMD parallel computers in the mmmm 
class says only that by the criterion applied, they are all 
equivalent. 

This is unsurprising and is not evidence of weakness 
in the classification. Indeed, it might point to why efforts 
to find criteria suitable for classifying all parallel machines 
have so far been unsuccessful: Qualities that are important 
for some computers are for other computers, unimportant, 
irrelevant or even misleading as a guide to classification. 


To the extent that the taxonomy provides insight by its 
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classifications, the “criterion” amounts to being a useful 
way of looking at some computers. (For cases where topol- 
ogy matters for sychronous machines, e.g. between the 
MPP and the CM1, see the next section.) 

Interestingly, the “criterion” apparently mandates the 
synchronous property. By using “the number of repeated 
instances of the principle functional activities of the fetch/ 
execute cycle” as the basis for classification, we are think- | 


ing of machines as either a single f/e cycle that has certain 
components replicated, or multiple copies of a f/e cycle. 
In the former case (all legal machines of the form sy, y 
€ {s,c,m}°) synchronous execution is mandated because 
there is only one cycle running. With the current structure 
of the taxonomy this leaves only the cy classes and. mmmm. | 
Though there is no requirement in the model that these be - 
synchronous, the class designators. provide some basis for 
deciding: The ¢ designation carries with it the implication 


of “other constraints” limiting the extent to which the f/e 
cycle can be replicated; it is reasonable to presume that 
these constraints might well mandate synchronous execu- 
tion. (For example, they do mandate synchronous execu- 
tion for the MIMD Multigauge members of the cece class.) 
However the m designation carries no such constraints and 
the possibility of arbitrary scaling would seem to imply a 
weaker coupling consistant with asynchronous execution. 
Thus, except for the mmmm machines, these computers 
are synchronous because of the structure of their reference 
streams, and not the other way around. Synchronous ex- 
ecution is a derived property. Indeed, the mmmm class ts 
the only one where there is a possibility for making an “en- 
gineering” choice between synchronous and asynchronous. 


7 Applications of the Taxonomy 


So far the foundations of a classificational scheme have 
been laid, a taxonomy of synchronous machines defined, 
and a number of machines assigned to classes. This is 
about all that can be presented here. The taxonomy’s 
structure is more extensive and the analysis is more com- 
plete. For example, consider the following: 

The classification has thus far clustered machines to- 
gether in broad groups, but in doing so it has glossed over 
significant differences among them. For example, the Il- 
liac TV and the CM2 (Connection Machine 2 [14]) are both 
in the class ssmm, yet they have the following differences, 
among others, some or all of which are probably signif- 
cant: 


Illiac IV CM2 
Number of PEs 64 65,536 
Datapath width (bits) 64 1 
Freq’cy PE gen’ates 
op’nd addrs (instr.) 1 32 
Topology “torus” trunc. bin. cube 


Blurring such distinctions is a necessary role for tax- 
-omonies: The higher the classificational level the greater 
the distinctions being ignored. When detail matters, how- 
ever, these distinctions must be expressed. The classific- 
tional scheme has been extended to capture such things 
as the size of the data items, the bandwidth to memory, 
topology, etc. These added features not only permit a de- 
scription of the differences between the Illiac IV and the 
CM2, but they have suggested apparently new architec- 
tures [17]. 
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ABSTRACT - In constructing modules for parallel execution, 
an important thing to be considered is what execution scheme 
will be taken for parallelism. In this paper, we develop some 
parallel execution schemes in the Petri net model of a task and 
present a module construction algorithm for each scheme. The 
maximum firing rule is used for large amount of parallelism. 
This is useful in the early stage of software development for a 
multiprocessor. 


1, INTRODUCTION 


In a parallel processing system, a task is assumed to con- 
sist of modules interconnected with each other and a module 
consists of actions. Petri nets and related graph models pro- 
vide an important formalism for modeling and analyzing asyn- 
chronous and concurrent activities of a task. These models 
have been widely used for representing and analyzing those 
tasks in various applications [1]-[3]. However, little work has 
been done for constructing modules for parallel execution, 
which can be run on a multiprocessor. There is an important 
thing to be solved, i.e., what execution scheme will be taken 
for the parallelism. An execution scheme was proposed in [4] 
for this problem. But, that scheme requires a lot of synchroni- 
zation, so the timing overhead for task control becomes large 
when it is implemented in a multiprocessor. 


_In this paper, we develop some Petri net based parallel 
execution schemes and present an algorithm for constructing 
modules by using each of the schemes. They are focused on 
increasing asynchronous activity and thus reducing the control 
overhead. The maximum firing rule is used for large amount 
of parallelism. Further, how the execution scheme will be real- 
ized as a software, which is effective in the respects of modu- 
larity, modifiability, and so on, is considered. It is based on 
the hierarchical decomposition technique proposed in [5S]. 
Thus, if a task can be represented by a tree-structured form 
with a certain hierarchy, task assignment in a multiprocessor 
with arbitrary number of processors becomes no longer the NP 
problem [6]. We omit the definitions of Petri nets and related 
details including languages, and follow ones in [7]. 


2, PRELIMINARIES 

In a Petri net, a place p is an input (or an output) place 
of a transition t if and only if there exists at least an arc (p, t) 
(or (t, p)). The bag of all input (or output) places of a tran- 
sition is called the precondition (or postcondition) and denoted 
as I(t) (or O(t)). Thus, a marked Petri net N is defined as a 
S-tuple N = (P, T, I, O, My). We consider the places with 
non-zero markings for the representation of a state. Thus, the 
initial marking of the net in Fig. 1, is represented by a state 
{3p,}. We regard the firing of a transition as an action and a 
module consists of actions. 


Now, we give a formal definition of maximal parallelism. 
It is justified by the maximal-firings of which each element is 
a maximal set of transitions firable in a state. 


Definition-1 (Maximal-firings M, ) 
Let N = (P, T, I, O, Mg) ~ a marked Petri net, S be a 
state represented by a ‘bag of ° laces and p(y) be the power set 
of a set y. All the enabled transitions in the state S, M,(S) is 
the set defined as 
M,(S) = tt, t, € T | I)G S} 
The maximal-firings in S, M, (S) is the set of all maximal X’s 
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over M,(S) such that 
Xe p(M,(S)) and (2, x7 ()) © 


Remarks for justification : Obviously, the M,(S) is the 
set of all the transitions enabled in the state S but may not be 
simultaneously firable, because some transitions may share their 
input places. Therefore, every set of simultaneously firable 
transitions in S must reside in the power set p(M,(S)). Since 
each element X of the M (S) is defined as one of the elements 
in the power set, its domain is M,(S). Moreover, since an ele- 
ment X must satisfy the condition (> .7(t)) C S, all transi- 
tions in X are simultaneously firable in S. Finally, we select 
only maximal ones among such X’s, therefore M.(S) is the set 
of all maximals of simultaneously firable transitions. Note 
that transitions which share input places need not be in conflict. 
If the marking has sufficient tokens in the shared input places 
to enable each contending transition individually, then those 
transitions can fire simultaneously || 


All the transitions in each element of M f fire simultaneously, 
thus we call such a firing scheme maximum firing rule. 


Some concurrency metrics are extracted from M,(S) and 
each state of a Petri net in execution is characterized’ by using 
them. Note that the M@,(S) has the general form of 

M .(S) = {X,,X23-,Xy}, where 1 < N < |p(M,(S)I, 
and each X, € M, (S) has the form of 

xX, = ae ee + ,t,,}, where 0 <n < |TI. 


Definition-2 (Concurrency Metrics) 
Let M,(S) = {X, | 1<k <N_ integer}. 


(a) degree of concurrency : on a at cae 
(b) degree of decision : D(S) = 
(c) concurrency to decision ratio yo See C(S)/D(S). 


For example, in the Petri net shown in Fig. 1, suppose S, 
given by {p PP 3 Pah, then M Si )= {{t, > 1}, {ty st ah}. Nin 
this case, C(s) = 4 and D(S) = 2 so that CDR(S) 1s 2, which 
means that the average number = transitions which can fire 
simultaneously in S, is two. 


According to C(S). and D(S), each state of a Petri net in 
execution under the maximum firing rule can be classified into 
five classes such as class-0, class-1, class-2, class-3 and class-4 
as shown in Table-1 [8]. Every state of a marked Petri net in 
execution under the maximum firing rule belongs to one of 
these classes. They show what transitions and how (sequentially 
or disjunctively or concurrently) they can fire. 


PARA X TI {ME 


Three parallel execution schemes are developed and a 
module construction algorithm for each of them is presented. 
We use the classes of states for describing those construction 


algorithms, because each state of a Petri net in execution can 
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be identified by its class. First, two partial states called the 
firing state and idle state are defined and a state S is decom- 
posed to them. 


Definition-3 
For a given state S, the firing | state and the idle state 
each X, « M,(S) are defined as S; = CQtex, I(t)) and s: 


5S > S; feeecaedy. where each state is a bag of places. 


As shown in Table-1, since there is no parallelism in 
those states of the class-1 and class-2, execution schemes for 
them are sequential or disjunctive. The class-4 is just the 
combined case with the class-2 and class-3. Thus, class-1, 
class-2 and class-4 are excluded in our discussion of parallel 
execution schemes. From now on, we only consider the class-3 
and each procedure corresponding to class-1, class-2 and class- 
4 is separately described in Fig. Al. 


Procedures for class-1, class-2 and class-4 
Class-1 : (* M f&) = {{t}}, deterministic and sequential *) 
begin 
Sr = I(t); 5S, =S- Sr ; execute the action t ; 


S’ = O(t) + S, ; invoke this algorithm for S’ ; 
end 
Class-2 : (* M,(S) = {{t,}, {to}. {t,}}, disjunctive and sequential *) 
fork = 1toN 
begin 
S; = I(t) ; sk =§- Se ; execute the action ¢, ; 
S k = O(t,) + si ; invoke this algorithm for S* 
endfor 
Class-4 : (* M 6S) = X,,°"',X,y}, disjunctive and concurrent *) 
for k=1 to N 


follow the procedure Class-2 or Class-3 according to |X x! : 
(* The procedure Class-3 will be described in each scheme *) 
endfor 


Fig. Al. The procedures for class-1, class-2 and class-4. 


3.1 Lock-St (© tio 

The LSS scheme is based on a lock-step manner with a 
single-step parallelism. The scheme for a state S belonging to 
the class-3 is as follows: 1) find M,(S), 2) decompose S to the 
firing state and the idle state, 3) all the actions in X e M,(S) 
are executed simultaneously and 4) a next state is generated by 
merging the resulting Output states (O(t,)) with the idle state. 
If the next state is belongs to class-3, this procedure is contin- 
ued with the next state, otherwise follow the procedures in Fig. 
Al. Therefore, if the class-3 states are continued, the module 
construction using the LSS scheme is represented as a task 


Orso by using a prefix language. 


Crss = ( my m,---m,), (3-1) 


and each m, is 
m,= (| | @,), (3-2) 


where the dot (-) and a double-bar (|]) are used as control 
operators representing sequential and concurrent operations 
respectively. In (3-2), each w, is a single action. Thus, the 
module construction with this scheme is represented by a 
sequence of parbegin/parend of actions. Suppose that a mul- 
tiprocessor consists of a single control unit (CU) and related 
number of processing elements (PE), and they are intercon- 
nected by an appropriate communication network. An advan- 
tage of this kind of parallelism is that the module construction 
for a task is easily accomplished in the CU, and thus its allo- 
cation to the related number of PE’s is simple. However, this 
scheme has a major defect that the synchronization must take 
place whenever each module m, is completed, and then another 
module m, is allocated to the PE’s by the CU. The number 
of synchronizations is large. Moreover, in constructing modules 


for parallel execution with this scheme, the completion of each 
module m, will take the maximum time among the parallel 


actions, i.e. T(m,) = max{T(,)} + a, where a is the time 
penalty required for a synchronization. The control overhead of 
parallel execution for a task with this scheme is too high if the 
class-3 states are continued. 


OFT oD) eee 


Now, we present two other execution schemes for reducing 
the number of synchronizations. 


parallelism will be lost. 
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3.2 Partial S B hing (PSB) Sct 
This PSB scheme, in contrast with the LSS scheme, 
exploits asynchronous parallelism in depth to reduce the over- 
head caused by large number of synchronizations in the LSS 
scheme. The basic idea is when concurrent actions are executed 
in parallel, the local output state O(t,) of each action is 
independently branched and the M, is exploited in each of the 
local output states. The same procedure is repeated until each 
state is terminated (class-0) or duplicated. An algorithm for 
constructing modules with this scheme is shown in Fig. A2. 


<ALGORITHM PSB> 
(* HLISF and QLIST are global data structures*) 
(* and they are initialized to be empty *) 
(* The HLIST is a buffer for storing those states already examined, *) 
(* and the QLIST is a fifo queue storing those states to be examined *) 
Step 1: if (S « HLIST) then return ; (* duplicated state is excluded *) 
else save S into HLIST ; (* for duplication check *) 
Step 2: Find the M rS) and the CLASS of S ; 
Step 3: case CLASS ; 
Class-0 : break ; (* terminated state *) 
(* refer to Fig. Al for class-1, class-2 and class-4 *) 
Class-3 : (* M 6) = {{t,, .. , ¢,}} : deterministic-concurrent *) 
begin 
(* Parallel module, use double bars*) 
Sp = aip! &) 3S; =S$-S,; 
fork =1toN 
begin 
execute the action ¢, ; s* = O(t,) ; 
invoke this algorithm for S ig ; 
endfor 
endclass 
endcase 
8 = 2,40) +S, 5 
invoke this algorithm for S’ 
endPSB. 


Fig. A2. A parallel module construction algorithm with PSB scheme. 


If we also assume that the class-3 states are continued, then the 
module construction by using the PSB scheme will be described 
by a task Opcp 


Opsp=(- my mz m, ) (3-3) 
and each m, is also expressed by 

m,=(| | WW, ** Oy ), (3-4) 
but wm, = a, is a single action or w, = (- a; a. “° a?) is a 
sequential module consisting of actions or 0, = (|| 
a, a? a; .) is a parallel module consisting of actions. The 


difference between the modules constructed by the LSS and the 
PSB can be found from (3-2) and (3-4). The w, in (3-2) 


must be a single action while the w, of the PSB scheme may 
be a sequence of actions or a parallel module as described by 
(3-4). This means that the asynchronousity is higher than that 
of the LSS scheme, i.e., lw; |;0. < lo; | psp- However, we 
can find that subparallelism: may occur in the PSB scheme 
because each w, can be a parallel module and thus each a, 
can also be a parallel module. For example, the following 
cas described by o has such a subparallelism. 

=( (| Ga (il G ¢ e&) ¢ df) Bb) ¢ bg) i) 
When the ‘oes occurs, the number of PE’s is a deci- 
sion factor whether the subparallelism will be serialized or not. 
If it is less than the maximum edge-cut (3 in this case), some 
However, if it is at least equal to 
three, the maximal parallelism can be achieved. 


}.3 Extended Partial nchin 

This is an extension of the PSB for higher asynchronou- 
sity than the PSB scheme. In the PSB scheme, all states ter- 
minated locally are merged regardless of possibility of execu- 
tion of another action in each terminated state (See Fig. A2). 
Such possibility can appear at the time when each locally ter- 
minated state is joined with a part of S,. Hence, in the XPBS 
scheme, when all states are terminated locally during the PSB 
scheme, S, is distributed to those terminated states and further 
executable actions are explored. If some actions are executable 
in the state joined with a part of S,, they are further executed. 
More precisely, during the PSB algorithm (Fig. A2), a part of 
the idle state is distributed to those terminated states that can 
have executable actions when they are joined with a part of S,. 
The idle state is changed when this kind of distribution occurs. 
When each terminated state is not executable any more even if 
the idle state is partially or totally joined, a next state is gen- 
erated by merging those terminated states with the changed idle 
state (if it exists). When the same part of S, denoted by P(S,) 
is distributed to more than one terminated State, the following 
conflicts can occur. 


CHE 


the executable actions are the 1) same or 2) different. 

The first-come/first-fit strategy can be used for the first case, 
because the executable actions are the same. However, for the 
second case, the P(S,) is not considered for the distribution but 
the other P(S, ) is considered for the distribution to those states 
or the other terminated states are considered for the distribution 
of the P(S,). If this scheme is used for exploiting parallelism, 
higher asynchronousity than the PSB scheme can be obtained 
and thus rare synchronization is achieved, because |w, | psp S 


Jo, | psp 


Since the subparallelism can also occur in this 
scheme, the same concept as the PSB is applied. Therefore, 
the module construction by using this scheme has the same 
descriptions as ones represented in (3-3) and (3-4), but their 
task sizes are different. The construction algorithm using this 
‘scheme is given in Fig. A3. 


Let O55, Spsp aNd Cypop be tasks obtained by using the 
LSS, PSB and XPSB schemes. respectively, and let N.(o) be 
the number of synchronizations occurring in the task co. Then 
we can find that 


N,(Ozs5) 2 N, (psp) = N,(Oypsg) 


This is appeared as an overhead due to the synchronization, 
because this overhead degrades the performance of a task in a 
parallel processing environment using those parallel execution 
schemes. Less the total synchronization overhead, better the 
performance of a system. 


3.4 Paralle vior entation 


A labeled directed graph called an AND/OR reaehabiliss 
graph is used for representing the parallel execution of a task. 
It is a labeled directed graph, G = (V, E, L), where V is a set 
of nodes, E € V XV is a set of directed edges and L: E- T 
is an edge labeling function mapping a transition to each edge. 
Obviously, each node represents a state and each directed edge 
is labeled by an action. The graph is a reachability graph of 
a marked Petri net under the maximum firing rule, so that it 
must be able to describe both of parallel and disjunctive 
behaviors. Thus, as mentioned before, double-bar (||) and plus 
(+) are used for indicating a parallel module and a disjunctive 
module respectively. The former is assigned to the class-3 
nodes and the latter is assigned to the class-2 nodes. Both of 
them are used for class-4 nodes because they have both opera- 
tional properties of class-2 and class-3 nodes. 


There are four branching rules prescribed for the con- 
struction of an AND/OR reachability graph. They are the 
sequential -branching with a_ single action (SBS), sequential 
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<ALGORITHM XPSB> 
Step 1: if (S « HLIST) then return ; (* duplicated state is excluded *) 
else save S into HLIST ; (* for duplication check *) 


Step 2: Find the M 7) and the CLASS of S ; 
Step 3: case CLASS 
Class-0 : break ; 
(* refer to Fig. Al for class-1, class-2 and class-4 *) 
Class-3 : (* M ,(S) = tty, 4H) 
begin 
S, = caty! OE) 55. =S-S, ; 
fork = 1 toN (* Start PSB *) 
begin 
execute the action ¢, ; s* = O(t,) 
invoke this algorithm for sé : 
endfor 
endclass 
endcase 
(* all the states are terminated and thus S, is distributed *) 
for each S* 
if S* can be executable when a P(S,) is joined 
then begin 
s* = s* + PS); 5S, = S, - P(S,) ; 
invoke this algorithm for s* 
end 
endfor 


(* there are no more executable states *) 
Sy = gain? %) + S, ; (* a next state *) 
invoke this algorithm for S’ 

endX PSB. 


Fig. A3. A parallel module construction algorithm with XPSB scheme. 
branching with n multiple actions (SBM), disjunctive branching 
(DB) of n actions and parallel branching (PB) of n actions. 
Each of the branching rules is represented graphically as shown 
in Fig. 2 by using the concurrency and disjunction marks (|| 
and +). They are described as follows: 


(1) Sequential Branching with a Single action (SBS) 
(S = S,+5,) +' (S = O(t) + S;,), where S, = I(t) 

(2) Sequential Branching wi with n Multiple actions (SBM) 
(S = Sp ts; )- "(9 = Dain?) + S;), 
where S, = Ser I(t,) and use the mark ||. 

(3) Disjunctive Branching (DB) of n actions 
for k= 1 ton ‘ 

(S = SF+Sh ~* (s* 
where S* = I t,) 
endfor (use the mark +) 

(4) Parallel Branching (PB) of n actions 


i O(t,) a3 S 


fork =l1ton 
(S = 5, +5S;,) +* (gt = Of), 
where S, = a nl &) 


endfor ge the matk 1) 


As we see, there are two different cases in the sequential 
branching such as the SBS and the SBM which are used for 
class-1 nodes and class-3 nodes respectively. The SBM is the 
synchronous parbegin/parend represented by a single step paral- 
lelism, while the PB is an asynchronous parbegin/parend. An 
important aspect is that the SBS, SBM and DB generate the 
global states as next states while the PB generates the partial 
states as next states. Therefore, it is notified that the PB is 
necessary to describe the PSB and the XPSB. However, it is 
not required for the description of the LSS scheme. _ 

The following elimination rule is also used to detect 
whether a node must be branched or not. 

Elimination rule : | 
If a state is terminated or duplicated, any branching is 
not carried out. 


The termination is detected by checking whether the state 
belongs to the class-0 or not and the duplication is detected by 
checking a queue (HLIST) which stores all the states generated 
before. Therefore, the elimination rule can be implemented by 
the procedure checking the class of a state and the queue. 


The maximal-firings is computed in a current state and 
characterized by one of the five classes, so that the AND/OR 
reachability graph is dynamically constructed by the enumera- 
tion of classes of states and actions together with several data. 
Thus, each graph corresponding to each of the parallel execu- 
tion schemes the LSS, the PSB and the XPSB can be con- 
structed by using those algorithms represented in Fig. Al, A2 
‘and A3. We have used the depth-first method for their con- 
structions. 


nets, various control modules have been developed for 
representing software structures [3], [9]. We classify them into 
4 control modules such as a begin—end, an if —then —else 
module, a do—while module and a parbegin—parend module. 
Each of those control modules can be expressed by a Petri net 
language. To represent the PN language as a prefix language, 
four control operators with same precedence are used. They 
are the serial operator (-), the disjunction operator (+), the 
concurrency operator (||) and the loop operator (*) and they 
are used for the four control modules respectively. Note that 
the do —while module is regarded as a loop of several actions 
or modules with a single loop counter Jc, and we assume that 
the iteration is finite. For example, a PN language (a - (b + 
(c || d)) - e - f)” is represented by the prefix notation (* (- a 
(+ b (| ec d)) e f) lc), where Ic is a loop counter. 

A 4-step process is proposed for realizing an architectural 
software. 


(Step-1) Construct a AND/ORveachability graph describing a 
parallel execution scheme. 

(Step-2) Get the Petri net language described by a prefix 
notation. 

(Step-3) Decompose the language and generate a hierarchical 
relation among decomposed modules. 

(Step-4) Generate a software structure with the hierarchical 


relation. 


The steps (1) and (2) can be carried out by using the algo- 
rithms represented in Fig. Al, A2 and A3. The step (3) 
decomposes a Petri net language and generates a hierarchical 
relation among decomposed modules represented by a tree 
called a fork tree. The decomposition of a language gives us 
a decomposition of the task. The root node (module (0, 0)) of 
a fork tree is just the given language. In (Step-4) a hierarchi- 
cally structured program for the task described by a Petri net 
language is generated by using the four control modules. 


A I EXAMPL 


An example illustrating each parallel execution scheme is 
given for a Petri net model and performance comparison 
between them is discussed. The Petri net shown in Fig. 3 is 
the model of a simple CPU which consists of three operation 
cycles under the maximum firing rule. They are an arithmetic 
cycle (AC), a store cycle (SC) and a branch cycle (BC). All 
of them start from the instruction decoding action a and are 
classified by the action. We exclude the branch cycle because 
there is no parallelism in the cycle. The two task-cycles, which 
are constructed by applying the LSS, PSB and XPSB schemes 
to AC and SC, are represented by the Petri net languages with 
prefix notations as follows: | 
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For the LSS scheme, 


AC roo = ( ab, c (| f Dg, i n) 
(- 


(|| 84 1) j n) 


SCi gg = a b, d 


For the PSB scheme, 
AC (| ¢ £ g, i) I) a) 


(|| 81 1) j n) 


PSB ~ 


SCocp = 


For the XPSB scheme, 


ACypsn = (a b, c (| ¢ f 2) i) 1) n) 
SCypcp - (- a b, d (| ¢ 84 )) 1) a) 


Consider the arithmetic cycle (AC) for the comparison between 
the LSS and the PSB scheme. We can find that the task-cycle 
of the AC,,, consists of a 7-module sequence with one paral- 
lelism, where the two actions - f and / - are executed simul- 
taneously. And the task-cycle of the ACpop consists of a 5- 
module sequence, where the module (- f g, i) and the action 
/ can be simultaneously executed. Therefore the asynchronou- 
sity for the PSB is higher than that of the LSS. Furthermore, 
the execution according to the PSB is equal to or faster than 
that according to the LSS when they are implemented in a 2- 
processor system, because the execution of the module (- (|| f 
1). g i) takes longer time than that of the parallel module (|| 
(| £ g, i) J). 

Consider the store cycle (SC) for the comparison between 
the PSB and XPSB schemes. The task-cycle of SCp., consists 
of a 6-module sequence, where the two actions g, and / can 
be executed in parallel, while the task-cycle of SCyp., consists 
of a 5-module sequence, where the module (- g, j) and the 
action / can be executed in parallel. The asynchronousity for 
the XPSB is higher than that of the PSB. In the same sense, 
the execution according to the XPSB is equal to or faster than 
that according to the PSB when they are implemented in a 2- 
processor. Moreover, if the execution of the action / is longer 
than that of the module (- f g, i), the PSB scheme is much 
better than the LSS. Also, if the execution of the action / is 
longer than that of the module (- g, j), the XPSB is much 
better than the PSB. 


From this example, we can find that, the PSB scheme is 
more efficient than the LSS, and the XPSB than the PSB due 
to higher asynchronousity. With enhancement of asynchronou- 
sity in the PSB and the XPSB schemes, the timing overhead for 
task control in a multiprocessor is reduced from the LSS 
scheme. Each of three parallel execution schemes describes the 
parallelism for a special computer architecture. For example, 
the LSS scheme describes the synchronized parallelism with 
lock-step manner and thus it may be better that the LSS 
scheme is applied for SIMD machines such as array processors, 
while the PSB and the XPSB schemes are for a message-based 
distributed system or a loosely-coupled multiprocessor system 
(especially tree-structured machine) due to high asynchronousity 
among parallelly processable modules, and thus the overhead 
for the task control is reduced. 


6. CONCLUSION 


We have presented three parallel execution schemes in a 
Petri net and discussed differences between them. A module 
construction algorithm for each of them is given and imple- 
mented in a functional language Lisp. Main consideration is 
to increase the asynchronous activity for reducing the number 
of synchronizations when a task is executed in a multiprocessor. 
The extent of asynchronousity differs by an employed scheme, 
but their parallelism is kept to be maximal, because the execu- 
tion of a Petri net follows the maximum firing rule character- 
ized by simultaneously executing the maximal-firings in each 
state. We have shown that the timing overhead for the control 


of a task in a multiprocessor is reduced by increasing the asyn- 
chronousity. The schemes are followed by the hierarchical 
software construction which represents the dynamic behavior of 
a task. The overall procedure can be automated by a 4-step_ 
process. This is useful in the early stage of software develop- 
ment for a multiprocessor. 
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Fig. 1. A marked Petri net with S) = (3p4}. 


Table-1. The state characterization by C(S) and D(S) 


at ae eae eee 0 no_action 

1 {{t}} 1 deterministic- 
sequential 

2 {{ta}, {ta}, J C(S) = D(S) = 2 | D(S) = C(S) = 2 | disjunctive- 
sequential 

3 > deterministic- 
concurrent 

4 


disjunctive- 


concurrent 


Ace 1X9, Xv) 
where C(S) > D(S).>2 | CGS) > D(S) = 2 
Xp = (teower bent | 
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(b) Sequential Branching 
with n Multiple actions (SBM) 


(a) Sequential Branching 


with a Single action (SMS) 


wk 
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(c) Disjunctive Branching (DB) 
of n actions 


1 1 
So = Ot, 48; + "= O(t, 487 


a UES; 
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(d) Parallel Branching (PB) 


of n actions 
S = O(t)) 28 ot 


OC a) 


Fig. 2. Graphical representation of four branching rules. 
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Fig. 3. A Petri net model of a simple CPU. 
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Abstract. The HITACHI supercomputer S-820 
has been developed as Hitachi's top end super- 
computer. It is also rated as one of the most 
powerful supercomputers in the world. 

To achieve the performance goal, the S-820 em- 
ploys advanced vector execution control. Of the 
features of the S-820's vector processing, this 
paper discusses parallelism between scalar and 
vector processing,.elementwise parallel execution 
and instruction stacking. 

Parallelism between scalar and vector processing 
greatly speeds up processing of short vectors. 

Elementwise parallel execution greatly speeds up 
calculations which have few terms. 

Instruction stacking greatly speeds up process-— 
ing of short vectors and increases the efficiency 
of elementwise parallel execution. 


1. Introduction 


With the progrss of science and technology, 
demand is increasing for processing large-scale, 
complex data in many fields, such as structural 
analysis, molecular science, nuclear fusion, 
semiconductors and natural resource exploration. 

Many supercomputers capable of processing large 
amounts of data as vectors have been developed to 
meet such growing demand. Their application is 
expanding to new fields such as computational 
experiments and large-scale simulations, which 
typically require more computation power and 
larger data storage capacities than other appli- 
cations. 

The HITACHI supercomputer S-820 has been devel- 
oped to meet these systems requirements based on 
state-of-the-art LSI technology. This paper 
introduces high-speed vector instruction execu- 
tion schemes of the S-820. 

These high-speed vector instruction execution 
schemes of the S-820 have been developed aiming 
at three goals described as follows. 


(1) Performance enhancement of short vector 
calculations 

Before starting the vector calculations, the 
preprocessing for the vector calculation such as 
generating the addresses of vectors is executed. 
Since the time required for this preprocessing 
is independent of vector length, the proportion 
of it to the vector calculation time, which is 
in proportion to vector length, becomes large in 
case of short vectors. On the contrary, in case 
of long vectors, the time required for the pre- 
processing becomes negligible compared with vec- 
tor calculation time. 

Thus the time required for the preprocessing 
becomes a great start up time and degrades the 
performance of short vector calculation. 
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Therefore performance of not only long vector 
calculations but also short vector calculations 
should be enhanced. 


(2) Performance enhancement of calculations which 
, consist of few terms 

The peak performance of a supercomputer is deter- 
mined as follows. 

Number of*arithmetic units 
MFLOPS = (1) 
Pipeline period (micro. sec) 

To increase vector processing performance, the 
processor usually must have as many arithmetic 
units which can operate in parallel as possible. 

But in case of calculations which consist of few 
terms, such as the calculation which consists of 
only one addition of two vectors, many arithmetic 
units are idling while the calculation is in exe- 
cution, since few (i.e. not all) arithmetic units 
are necessary for the calculation. 

As a result, the performance of the processor 
becomes by far lower than its peak performance. 

Therefore the performance of calculations which 
consist of few terms should be enhanced. 


(3) Reduction of loss time 
switching 

When the execution of the 
ithmetic unit ends and the 
given to it, the loss time is caused by instruc- 
tion switching. This loss time has greater effect 
on the performance of the processor as vector:ical- 
culation time becomes shorter. 

Therefore this switching time should be reduced. 


caused by instruction 


instruction in an ar- 
next instruction is 


To achieve the first goal (1), the highly paral- 
lelized processing between the scalar processor 
and the vector processor has been realized. 

This scheme is presented in section 6.1. 

To achieve the second goal (2), the elementwise 
parallel processing scheme has been developed. 

This scheme is presented in section 6.2. 

To achieve the third goal (3), the instruction 
stacking scheme has been developed. 

This scheme is presented in section 6.3. 


2. Architecture 


The architecture of the S-820 is shown in Fig.1. 
The S-820 consists of the scalar processor and the 
vector processor. The architecture of the scalar 
processor is compatible with Hitachi's general 
purpose computer M-series systems. Thus the 
S-820 supports standard data processing environ- 
ments, such as TSS(Time Sharing System) and RJE 
(Remote Job Entry). 

The architecture of the vector processor includes 
90 vector instructions, 32 sets of vector regis- 


a 


ters, 16 sets of vector mask registers, 32 sets of 
scalar registers, 48 sets of vector address regis- 
ters, vector address translation feature and the 
vector processing timer. 


I) TACHI 


M-series Architecture : Extended Vector Architecture 


| M-series instructions 216 Vector instructions 90 
General-purpose registers 16 Vector registers 32 
Floating-point registers 16 Vector mask registers 16 
Control registers 16 Scalar registers a2 
Address translation Vector address registers 48 
Storage protection Vector address translation 


TOD clock Vector processing timer 
CPU timer 


Clock comparator 


Extended channel system 


Fig.1 S-820 Architecture 
3. Processor Organization 


The S-820 is available in two models: 
S-820 model 80 (S-820/80) 
S~820 model 60 (S-820/60). 

The S-820/80 is twice as powerful as the S-820/60 
and has a larger maximum storage capacity. 

The 820 consists of vector processor, the scalar 
processor, the main storage, the extended storage, 
the input/output processor(s) and the service 
processor. 

Fig.2 shows the processor organization. 
shows the specification of the S-820. 


Table 1 


Instruction Processor 


VP 


Add/Logical | 


SVP 


IOP 


MS : Main Storage BS 
ES: Extended Storage FPR : 
5C : Storage Controller GR : General Registers 
SVP : Service Processor SR: Scalar Register. 
IOP : 1/0 Processor VAR : Vector Address Register 
VR: Vector Register 
VMR : Vector Mask Register 


: Buffer Storage 


Floating-Point Register 


Fig.2 Processor Organization 
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we 
Table 1 Specification of S-820 


Piotr Inari - 


General Registers 


Registers 32 x 256Words 
[Fewecni |e 
Vector Processing Timer | Supported 
Capacity (MB) 


} Bit Error Correction 


Length 


Main 
Storage 


Error Checking 


and 2 Bit Error Detection 


Extended 
Storage 


1/0 
Proc 


4, Outline of Vector Processing 


Like the predecessor S-810[1], the S-820 features 
parallel operations between scalar and vector 
processing, as illustrated below. 


For example, a DO loop in a FORTRAN program: 
| DO 20 [= 1, N 
20 ACI) = BCI) x C(I) 
is compiled into a chain of vector instruc- 


tions: 
VLD  VRO, B 
VLD VR2,C ? 
VEMD VR4, VRO, VR2 
VSTD VR4, A 

Where, | S) 
VLD : Vector Load VLE) 
VEMD : Vector Elementwise Multiply \\ iS () 
VSTD 3; Vector Store dy S 
VRi : i-th Vector Register Q 


A(I), B(I) and C(I) are vectors. The execution | 
of this program is shown in Fig.3. The scalar 
processor (SP) performs preprocess for this chain 
of instructions and passes control to the vector 
processor (VP). 

After this preprocessing, the SP issues an EXVP 
(Execute Vector Processing) instruction. The VP 
begins vector processing, i.e., fetches the vector 
instruction chain, executes a VLD instruction to 
load vector B to vector register 0, a VLD instruc- 
tion to load vector C to vector register 2, a 
VEMD instruction to multiply vector B by vector C 
and to store the result in vector register 4 and 
a VSTD instruction to store the result held in 
vector register 4 in vector A. 

During the vector processing by the VP, the SP 


can execute another scalar processing or preproc- 
essing for next vector processing. Parallel proc- 
essing between the SP and the VP shortens total 
program execution time. 


EXVP 
instruction 


SP “EXVP 


instruction 


Scalar processing 
or 

Preprocessing for 
vector processing 


Preprocessing 
for vector 
processing 


Next 
Vector 

instruction 
chain 


Vector instruction 
chain 


VP VLD VRO,B 


VLD VR2,C 
VEMD VR4, VRO, VR2 
VSTD VR4, A 


Fig.3 Parallel Operation between 
Scalar and Vector Processing 


5. Outline of Vector Instruction Execution Control 
5.1 Structure of Vector Instruction Control Unit 


Fig.4 shows the structure of the vector instruc- 
tion control unit of the S-820. 


Main Storage | 


Vector Instruction Fetch Part 


51 


Instruction Decode Part 


$2 


5 
IECP | 10 


a 


Register 
Conflict 
Check 

Part 


Resource 
Conflict 
Check 
Part 


514 


Instruction Issue Part 


Vector Register Contro] Part 


516 


511 


: Instruction Excution Control Part 
: Memory Requester 0 
: Memory Requester 1 
: Adder 
: Multiplier 


Fig.4 Structure of Vector Instruction Control Unit 


The main storage stores instructions and data op- 


& 
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erands. The vector instruction fetch part fetches 
a chain of vector instructions from the main stor- 


age when the SP issues an EXVP instruction and 


then it sends vector instructions in a chain to 
the instruction decode part. 

The instruction decode part puts the vector in- 
structions in the queue and decode them. 

The register conflict check part checks whether 
the destination register of the current vector in- 
struction becomes a source for the next vector in- 
struction, and has the information about the sta- 
tus of each vector register. 

The instruction issue part :sends the decoded 
vector instruction to the instruction execution 
control part. 

Memory requester 0, memory requester 1, adder and 
multiplier are called "resources". 

These resources are explained briefly as follows. 
Memory requester sends main storage access re- 
quests to the storage controller. Adder executes 
addition on operands. Multiplier executes multi- 
plication on operands. Each resource processes 
four vector elements jn patallel(See section 6.2) 

These four resources must execute different vec- 
tor instructions in a pipeline manner while main- 
taining the correct sequence of access (storing 
and referencing) of vector registers. 

The resource conflict check part checks the con- 
flict in use of any of the resources between the 
current vector instruction and the next vector in- 
struction, and has the information about the sta- 
tus of each resource. 

The instruction execution control part, which is 
attached to each resource, controls the execution 
of the vector instruction in each resource. 

The vector register control part controls data 
transfer to and from the vector registers. 


5.2 Vector Instruction Execution Process 


In this section the vector instruction execution 
process of the S-820 is explained by means of 
Fig.4. 

As mentioned in section 4, vector instructions 
are started by an EXVP instruction, which is a 
scalar instruction. The SP senses the condition 
of the VP and starts the VP when the VP is ready 
for execution of vector instructions. 

The vector instruction fetch part sends the vec- 
tor instructions to the instruction decode part at 
a rate of one instruction per machine cycle. 

At first the vector instruction which is sent 
from the vector instruction fetch part is decoded 
in the instruction decode part, that is, the vec- 
tor instruction is decoded into op-code, vector 
length, vector register number, resource number 
and so forth. 

After decoding the vector instruction, the in- 
struction decode part sends the aforementioned 
information to the instruction issue part, sends 
the register number of the vector register des- 
ignated in the vector instruction to the register 
conflict check part and sends the resource number 
of the resource used in the vector instruction to 
the resource conflict check part. 

In the register conflict check part, conflict in 
use between the vector registers used in the de- 
coded vector instruction and the vector register 
used in the vector instruction in execution is 


checked. 

In the same manner conflict in the use of re- 
sources is checked in the resource conflict check 
part. 

When no conflict is detected, the instruction 
issue part sends the information necessary for the 
execution of the vector instruction, that is, op- 
code, vector register number, mask information, 
chaining information and other control information 
» which is shown as data path S8 in Fig.4, and in- 
struction start signal, which is shown as S7 in 
Fig.4, to the instruction execution control part 
corresponding to the resource which executes the 
vector instruction. 

When any conflict is detected, the instruction 
issue part waits until the conflicts disappear. 

Each instruction execution control part sends the 
instruction to the corresponding resource by way 
of one of the data paths S9, S10, S11 and S12 and 
instructs the resource to execute the instruction. 

The resources and the vector register control 
part executes the instruction sent from the in- 
struction execution control part. - . 

The vector register control part reads the vec- 
tor data operands from the vector register desig- 
nated in the instruction and sends them to the 
resource which needs them. 

The vector register control part also writes the 
vector data which are sent from the resources to 
the vector register designated in the instruction. 
When the vector register control part ends read- 
ing/writing, it informs the resource and the in- 
struction control part by the signals S13, S14, 
S15 and S16 and also informs the register conflict 
check part. 

When a resource ends the operation, it informs 
the resource conflict check part and updates the 
resource status in the resource conflict -check 
part. 

It should be noted that vector instructions exe- 
cuted by different resources can operate in paral- 
lel when no conflict is detected, since each re- 
source is independent of other resources. 


6. Vector Instruction Execution Schemes 


6.1 Highly Parallelized Processing between the 
Scalar Processor and the Vector Processor 


As described in section 4, parallel processing 
between the SP and the VP effectively shortens the 
total program execution time. 

In order to further enhance performance the S-820 
employs the following two processing schemes: 

(1) Vector processing linking 
While the VP is executing a vector operation 
the execution of the next chain of vector 
instruction can be started. 

Vector processing signaling 

Even when the VP has not completed all the 
operations in the chain of the vector in- 
structions, the part of the scalar program 
which needs the result of a part of vector 
operations can be started, when it is ready. 


(2) 


Fig.5(a) shows parallel processing between the 
SP and the VP described in section 4, taking an 
example of a chain of vector instructions which 
is divided into two vector operations. 
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First the SP executes preprocessing for vector 
operations ) ; @) : 

Next the SP starts the VP by means of EXVP in- 
struction. 

While the VP executes the vector operation, the 
SP can execute scalar processing such as preproc- 
essing of the next vector operations. 

As soon as current vector processing has com- 
pleted, the SP can start the VP for the next vec- 
tor operation which has been kept waiting. 


Fig.5(b) illustrates how vector processing link- 
ing improves the performance. 

In case that the control information for the vec- 
tor operation is not necessary for the vector 
operation (1); the SP starts the VP immediately 
after preparing the control information necessary 
for the vector operation (). 

Next, the SP sets up the VP with the information 
necessary for the vector operation (2) and signals 
"linking operation" to the VP. 

The VP links the vector operation (@) with the 
vector operation GQ). i.e., executes the vector 
operation and the vector operation Q@) ina 
synchronized manner. 

As compared with Fig.5(a), the degree of paral-+ 
lelism is increased. As a result overall perfor- 
mance is enhanced. 

The compiler automatically partitions a chain of 
vector instructions into two or more smaller 
chains so that parts of them can be processed in 
an overlapped manner. 


Fig.5(c) illustrates how vector signaling en- 
hances performance, taking an example of a chain 
of vector instructions which is divided into two. 
vector instructions, where the first vector in- 
struction generates a scalar result which is nec- 
essary for following scalar processing. 

In Fig.5(c), the vector instruction () is the 
only instruction in a vector instruction chain 
which stores the calculation result in a scalar 
register. Such vector instruction is given the 
"signaling flag" by the compiler. — 

The vector instruction @) stores the result in 
the scalar register, which can then be accessed by 
a subsequent scalar instruction G).s 

The vector instruction (@) in such a case is 
given the "signaling flag'"' as mentioned above. 

When the vector instruction () finishes the op- 
eration, the VP tells the SP to start subsequent 
scalar instruction GB). 

Since the scalar instruction 3) does not wait 
until the vector instruction (2) which immediately 
follows the vector instruction (1) completes the 
operation, the SP and the VP operate in parallel 
after the completion of the operation of the vec- 
tor instruction @). 

As a result overall performance is enhanced. 


The realization of these two schemes, vector 
processing linking and vector processing signaling 
, utilizing vector instruction control unit shown 
in Fig.4 is described as follows. 


To realize vector linking, the vector instruction 
fetch part executes the following control. 

When an EXVP instruction without a "linking flag" 
is issued by the SP, the vector instruction fetch 
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Fig.5(a) Parallel Processing of the Scalar 
Processor and the Vector Processor 
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part senses the instruction decode part and the 
instruction issue part. In case that there are 
either instructions being executed or instructions 
in the instruction decode part, it does not send 

a chain of vectcr instructions to the instruction 
decode part. 

When an EXVP instruction with a "linking flag" is 
issued by the SP, the vector instruction fetch 
part send a chain of vector instructions to the 
instruction decode part in case that there is no 
instruction in instruction decode part. 


To realize vector signaling, ::the vector register 
control part executes the following control. 

When the vector register control part ends writ- 
ing the result in the scalar register which is 
specified in the vector instruction with "signal- 
ing flag", it sets the specific register called 
signal register. 

The SP senses this signal register and executes 
subsequent scalar instruction which needs the data 
held in the scalar register when it finds the 
signal register to be one. 

These “linking flag" and "signaling flag" are 
specified by the compiler. 


6.2 Elementwise Parallel Processing Scheme 
The elementwise parallel processing operation is 


realized as follows. 
The arithmetic unit of the S-820 model 80 con- 
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veo | 3 | 9| 10 | 11 vector 


Sists of four fully segmented pipelines. These 
four pipelines are operated concurrently. Thus 
memory requester 0, memory requester 1, adder and 
multiplier in Fig.4 all consist of four fully seg- 
mented pipelines. Therefore four elements can be 
processed in one machine cycle, whereas a single 
pipeline scheme would allow only one element to be 
processed in one machine cycle. 

Fig.6 shows elementwise parallel processing for 
addition of two vectors in the vector register 
NO.O(VRO) and the vector register NO.4(VR4). The 
result is to be stored in the vector register NO.8 
(VR8). 

At first, element NO.0, NO.1, NO.2 and NO.3 are 
processed concurrently as follows. Adder in Fig.4 
consists of four fully segmented pipelines, which 
are called AdderO, Adderl, Adder2 and Adder3. 

The vector register control part reads these four 
elements from VRO and VR4 and sends them to these 
four adder units, Adder0, Adderl, Adder2 and Ad 
der 3 concurrently. 

Next, the vector register control part sends exe- 
cuting signal to the AdderO, Adderl, Adder2 and 
Adder3 in parallel. 

Finally, the vector register control part writes 
the four results of additions to VR8 in parallel. 

These procedures are repeated for further ele- 
ments. 

Thus four consecutive elements are processed con- 
currently as one unit. 


VR&8 <— VRO + VR4 


read 
from 


Vector 
Register 
Control 

Part 


register 


execute 


write to vector register 


Fig.6 Elementwise Parallel Processing 


Fig.7(a) and Fig 7(b) show processing by means of 
single pipeline scheme and elementwise parallel 
pipeline (processing) scheme respectively, taking 
an example of vector calculation 

Ai = Bi x Ci + Di. (2) 

The vector length is assumed to be 20. The num- 
ber in Fig./7 means the element number. 

In Fig.7, load unit, load/store unit, adder unit 
and multiplier unit correspond to memory requester 
0, memory requester 1, adder and multiplier re- 
spectively in Fig.4. 

Loading of vector B, loading of vector C and 
loading of vector D are executed by the load unit. 

Multiplication of vector B and vector C is exe- 


Ad 
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Fig./ Performance Enhancement by 
Elementwise Parallel Processing 


cuted by the multiplier unit. 

Addition of the result of the multiplication, 
that is B x C, and vector D is executed by the ad- 
der unit. 

Storing the final result, that is Bx C+D, ina 
vector A is executed by the load/store unit. 

Since each unit in elementwise parallel process- 
ing consists of four fully segmented pipelines, 
the time required in each operation such as load- 
ing and adding is one fourth of that in single 
pipeline scheme. 

Thus the overall performance is improved. 


6.3 Instruction Stacking Scheme 


The point of this paper is "Instruction Stacking 
Scheme" described below. 

The aim of "Instruction Stacking Scheme” is to _ 
reduce the loss time caused by instruction switch- 
ing. "Instruction Stacking Scheme" is employed in 
the instruction execution control part. 

Fig.8 shows the structure of the instruction exe- 
cution control part. 

The instruction execution control part can have 
up to two vector instructions including the in- 
struction in execution, and starts the execution 
of vector instructions in order of their arrival. 

The current instruction register holds the vector 
instruction being executed. The next instruction 
register holds the next vector instruction. 

These two instruction registers play the role of 
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Fig.8 Structure of the Instruction Execution 
Control Part 


stacks, that ia, two instructions can be stacked 
for each resource. Therefore these two instruc- 
tion registers are also called instruction stacks 

As an example, the instruction execution control 
part for the adder is explained below. 

Data path S8 means the information necessary for 
execution of the instruction, that is, op-code, 
vector register number, mask information, chain-1. 
ing information and other control information and 
the signal S7 is the instruction start signal 
which means that the vector instruction is issued 
as explained in section 5.2. The signal S15 means 
that the execution of the instruction by the re- 
source and the vector register control part ended, 

In Fig.8 data path S8, signal S7 and signal S15 
are called "EX", "I" and "E" respectively, and 
data path $8 is also called "instruction". 

The counter consists off 2 bits and represents 
the number of instructions included in the in- 
struction execution control part. 

The control circuit updates the counter, issues 
the signal $21 which sets the next instruction 
register, issues the signal §23 which sets the 
current instruction register and issues the sig- 
nal S22 which switches the selector. 

Fig.9 is the status transition graph of the 
counter. Each node represents one value of the 
counter. The value 00/01/10 means that there are 
0/1/2 instructions in the instruction execution 
control part. The value 11 signifies that a mal- 
function has occurred and a machine check signal 
is sent to the SP in this case. 

In Fig.9, "I" means that signal "I" is issued 
and "I" means that signal "I" is not issued. 

This convention applied also to "E" and "E". 

The status transition of one example case is ex- 
plained as follows. 

Suppose that the signal "I'"' is not issued and 
the signal "E" is issued when the contents of the 
counter are "10". 

In this case, since the execution of the in- 
struction in the current instruction register 
ended and the new instruction is not issued, the 
contents of the current instruction register is 
replaced by those of the next instruction regis- 
ter and the next instruction register becomes 
empty. As a result the value of the counter be- 
comes "01", since there is one instruction in the 
instruction execution control part. 

The control circuit in Fig.8 operates as follows, 

Table 2 shows the values of signals $21, S22 and 
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Fig.9 Status Transition Graph of the Counter 


Table 2. The Value of Control Signals in 
the Instruction Execution Control Part 


523. In Fig.8 signals S21 and S23 set the next 
instruction register and the current instruction 
register respectively and signal S22 selects the 
contents of the next instruction register to data 
path $24 when its value is 1 and selects data 


6.4 Effect of the Instruction Stacking Scheme 


By employing the instruction stacking scheme de- 
scribed in 6.3, immediately after the execution 
of the instruction in each resource ends, the 
next instruction can be given to the resource. 

As a result, the resource can be usedumore effi- 
ciently since vector data to be processed by the 
current instruction and the next instruction can 
be given to the resources without a break. 

In order to show the effect of the instruction 
stacking scheme, the timing charts of the case 
where successive instructions use the same re- 
source are shown in Fig.10. 

The vector length of every instruction is assumed 
to be 3. In Fig.10, the number represents the in- 
struction. 

Fig.10(a) shows the case that there is no in- 
struction stack. Fig.10(b) shows the case that 
there are two instruction stacks, that is, the 
current instruction register and the next in- 
struction register in Fig.8. 

In Fig.10, the instruction start signal corre- 
sponds to S/ and the execution ending signal cor- 
responds to $13, S14, S15 or! S16 in Fig.4. 

In Fig.10(a), the execution on elements has one 
vacant cycle at the point cf instruction switch- 
ing. 

In Fig.10(b), the execution on elements is con- 
tinuous and resources do not have idle time. 


Next, the time T from the instruction start sig- 
nal for the first instruction to the execution 
ending signal for the N-th instruction is com- 
pared between Fig.10(a) and Fig.10(b) as follows. 
L means the vector length. 

Fig.10(a) : T=NxLt+N-1 (3) 
Fig.10(b) T=NxL (4) 

Therefore, the effect of the instruction stack- 

ing scheme to no instruction stack case is 


path S8 when its value is 0. P = (3) / (4) = 1+ 1/L - 1/NL (5) 
In this example case, the value of the counter, When number of instructions is large, (5) becomes 
that of signal "I" and that of signal "E" are P=1+1/L (6) 


Table 3 shows the effect on the S-820 model 80( 

4 elements are processed concurrently by element- 
wise parallel processing). 

Table 3 shows that the instruction stacking 
scheme greatly improves the performance in case 
of short vectors. 

In order to enhance the performance, supercomput- 
ers generally employ elementwise parallel pipeline 
processing scheme, and furthermore the degree of 
elementwise parallel pipeline becomes larger, that 
is, the number of elements which can be processed 
concurrently becomes larger. 

As a result the number of elements to be process- 
ed per pipeline becomes smaller. 

In this situation, this instruction stacking 
scheme which has a great effect on processing of 
short vectors becomes important. 


"10", "O" and "1" respectively. According to 
Table 2 the value of signal S22 becomes 1 and 
that of signal S23 becomes 1 so that the contents 
of the current instruction register may be re- 
placed by the contents of the next instruction 
register in this case. 

The other cases can be explained in the same way. 

The resource conflict check part has the status 
of these two instruction stacks, that is, the 
current instruction register and the next instruc- 
tion register so that superfluous instruction 
start signals (S7) may not be issued. 

By controlling the instruction execution as de- 
scribed above, outline of system control becomes 
as follows. 

When there is no instruction in execution, the 
instruction sent from the instruction issue part 
is held in the current instruction register. 

When there is an instruction in execution, the 
instruction sent from the instruction issue part 
is held in the next instruction register and the 
contents of the current instruction register is 
replaced by those of the next instruction regis-— 
ter upon completion of the instruction. 
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Table 3 Effect of the Instruction Stacking Scheme 
number of effect (P-]) 
elements per 

pipeline 


vector length 


7. Performance Evaluation 


Table 4 shows the algebraic average of the mea- 
surement results of Lawrence Livermore Laborato- 
ry's 14 Kernels under the following 4 conditions. 

@) : normal condition 
: without vector processing linking 
> without vector instruction stacking 
: without vector processing linking 
and vector instruction stacking 


Table 4 Algebraic Average of the Measured 
Results of Lawrence Livermore 
Laboratory's 14 Kernels 


condition 


: 


The following considerations are derived from 
Table 4. 
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effect of vector processing linking is 


@/® - 1 = 3.6 % (7) 
effect of vector instruction stacking is 
@/@-1= 5.1 % (8) 
The effect of combination of vector processing 
linking and vector instruction stacking is 


@©/@ -1= 10.1 % 


The effect of elementwise parallel processing 
has not been measured. Therefore it is roughly 
estimated as follows. 

The following two assumptions are made. 

First, performance improvement of the S-820 on 
the S-810 is assumed to be attributed to reduc- 
tion of machine cycle, elementwise parallel proc- 
essing, vector processing linking, vector proc- 
essing signaling and vector instruction stacking. 

Second, the effect of vector processing signal- 
ing is assumed to be negligible since the cases 
where vector processing signaling applies are 
somewhat limited. 

Since machine cycle of the S-820 is 57.12% of 
that of the S-810 and the algeblaic average of 
14 Kernels on the S-810 is 137.9 MFLOPS, the ef- 
fect of elementwise parallel processing is 

380.9 (() in Table 4) 


The 


The 


(9) 


X 0.571 - 1 = 58% (10) 


137.9 


\ 


8. Conclusions 


This paper presented high-speed vector instruc- 
tion execution schemes of the Hitachi supercom- 
puter S-820. 

Parallelism between scalar and vector processing 
, elementwise parallel processing and instruction 
stacking have been introduced as these schemes. 

In particular instruction stacking greatly im- 
proves the performance of processing of short 
vectors and increases the efficiency of element- 
wise parallel processing. 

By the performance measurement on Lawrence 
Livermore Laboratory's 14 Kernels, vector proc-— 
essing linking (which enhances parallelism bet- 
ween scalar and vector processing), vector in- 
struction stacking and elementwise parallel 
processing improve the overall performance by 
3.6%, 5.1% and 58% respectively (, where the ef- 
fect of elementwise parallel processing is esti- 
mated value). 
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Abstract -- Empirically-derived models are constructed of the shared- 
memory accessing delays associated with the dynamic-memory CRAY-2. 


In ion 


Although memory accessing studies of conflict delays have been carried out 
for the CRAY X-MP [1]{2][3][4], the introduction of the slow massive 
dynamic memory of the CRAY-2 (abbr. C-2) - producing typical algorithm 
delays in the range of 20-40% - has emphasized the need for further analysis 
of this phenomenon. 


The origin of this conflict problem is principally not collisions along 
multiple memory paths as studied in the 1970s [5], but rather from the 
bank reservation time (Tp,), i.e., the time for a chip to recover from an 
access. Although its effect can be lessened by memory reorganization, the 
qualitative effects of a large T}, can be compensated principally by 
increasing the number of banks, which is limited by space and hardware 
considerations with fast clock periods. 


Once a memory technology is proposed, then the principal question is the 
number of processors a given number of banks can support. 
Unfortunately, little insight is offered by current literature on the critical 
load parameters to be accommodated. There is, for example, no descriptor 
such as "hit-ratio" common in cache memory design. 


The C-2 common memory organization? is not as amenable to 
mathematical analysis (or even simulation) as that of the CRAY X-MP, in 
part because the C-2 contains a variety of buffering and bank-enhancement 
techniques ("pseudo-banking") to compensate for the small number of 
banks relative to Thy. Consequently, this paper proposes load parameters 
derived from measurements made on a dedicated C-2 - a "black-box" 
approach - and develops related mathematical and graphical models. 


Experimental Model 


The experiments to be reported in this report are based on use of a dedicated 
C-2. The four processors are conceptually divided into one processor 
executing a test code instrumented to measure delays and 0-3 active 


LOAD CODE LOAD CODE 


Figure 1. Experimental model 


LOAD CODE 
SHARED MEMORY 
TEST CODE 


4 See reference [6] for a partial description. 
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processors executing a load code (Figure 1); load processors are selectively 
rendered inactive by running a no-access code. The following experiments 
involve selection of test and load codes which are intended give insight into 
factors which influence the effects of load codes on a test code. 


Model Parameter Selection: 
Th Y r ing Anomal 


It has been noted in measurement of conflict-induced performance 
degradation that vectorized codes incurred inexplicably large delays relative 
to scalar codes with a similar total number of memory accesses. Table 1 
gives illustrative percent delays incurred in a test code when identical load 
codes are run on three processors. Both a scalar and vector load code were 
run against four representative test codes. The table graphically shows 
that, although a scalar test code is marginally impacted by another scalar 
load, the same scalar load produces large degradations in a vector test code 
similar to that produced by a high-access-rate matrix multiply vector load. 
That is, a low-access scalar code presents a large apparent load to vector 
codes running in other processors. 


Table 1. Algorithm delay (%) of test codes. All codes are Fortran. 
Run on NAS C-2 on 5/10/86. 


Load Codes 
Scalar (SORT) Vector (MXM) 
‘Test Codes (Matrix multiply) 

Scalar 

GATHER 2.6% 18.3% 

SORT 3.2% 26.3% 
Vector 

FLUIDS KERN. 32.0% 34.7% 

UNROLLED M*V 73.1% 85.6% 


Measurement Probes 


To develop a measurement standard not dependent on accessing peculiarities 
of an application code, the accessing delay will be measured directly in two 
test codes which perform scalar and vector accesses only. These will be 
termed "probes", because their purpose will be to illuminate the 
discrepancies noted from Table 1. 


These probes will be calibrated by measuring access delays on a dedicated 
C-2 under a variety of loads with known key parameter values. As a result, 
an equation for each probe will be developed in terms of these parameters. 
Measurements with these probes can then be inverted to find the parameters 
of an unknown load. 


matical Model of Pr 


With Table 1 as a guide, the parameters chosen to represent the load are 

1. the average startup rate (R,) of accesses from all loads 
(startups/cp), and 

_ 2. the average rate (Rg) of accesses from all load (accesses/cp). 

For a vector access with length VL, R,g = R,*VL. The dependence on Rg 
represents the influence of accessing irregularity introduced by scalars and 
short vectors; the effect of memory traffic of any origin is given by 
dependence on Ra. 


The access delay (D) measured in the probe is defined as the extra time in 
cps either (1) to secure an scalar element in a register for the scalar probe or 
(2) to clear the memory path after a vector access in the vector probe. It is 
to be represented in a N-variable Taylor's series truncated after the Mth 
order, viz, with variables Rg and Rg and M=3 


D= aj + agRq + a3Ry + agRgRg + asRq7 + agR,* + a7RaR,g 


+ agRaRg~ + agRg? + ajoRg? (1) 


The delays for the scalar and vector test probes will be indicated Dps and 
Dpv> respectively. 


Saturation 


The limiting rate at which accesses can be made occurs when all memory 
banks are continuously reserved due to Tpy. If the load accessing rate is 


Rg, the test code accessing rate is Rycg , and the number of banks is Np, 
then define the memory saturation fraction as 


Feat = (Rat Rica)Tp/Npb 


<l 


(2) 
(3) 


The proximity of Fo, to unity will be a measure of the load intensity. 


Scalar Probe 


ibration 

In the scalar probe, a read is performed whenever the previous scalar read is 
secured in a register - a high but representative rate for scalar codes. 
Successive reads are one address apart to avoid self-conflicts. Average delays 
in five groups of 4000 reads are recorded (20000 total); these group 
averages are scanned for consistency and a resulting overall average 
determined. 


To calibrate the probe, the load codes are chosen to contain only unit-stride 
vector accesses, but these are performed at selectable no-load rates. To 
compensate for the regularity of the scalar test code, the load codes are 
chosen to have irregular accessing, viz, 

(1) vector lengths are chosen uniformly random over 13 selected 
ranges given by VLmin = {1.2,4,8,16,32,64,1,8,16,24,1,4} and VLpyax = 
(1,2,4,8,16,32,64,63,56,48,40,15,12}, yielding the average load vector 
length vl = (1,2,4,8,16,32,64,32,32,32,32,8,8}; the latter six vls test the 
use of the mean vector length as a load descriptor irrespective of the 
standard deviation of vector lengths; | 

(2) the addresses of the first access of each vector are uniformly 
randomly distributed across 256 banks, to accommodate the branch- 
doubling effects of pseudo-banking in extending 128 physical banks; 

(3) the average no-load startup rate of load vectors are identical for 
all active load processors during a single experiment, but are varied at six 
selectable rates between experiments. 

The thirteen vls and six startup rates yield 78 experiments. In addition, it 
was noted that the accessing rates of the load codes are slowed by conflicts; 
the load codes are consequently themselves instrumented to measure their 
actual average accessing rates (another 78 experiments). These measured 
loaded average startup rates will be denoted ry; the averaged measured values_ 


for Rg are denoted rg and determined from rg = r.*vl. All experiments are 
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performed automatically by multitasking. Approximately three minutes of 
elapsed time are required to carry out the 156 experiments on the C-2. 


An elementary requirement of the calibration process is that the loads be 
chosen so that the delays are in the range of those of daytime loads. In 
these 78 tests, the average measured probe delay was 28cp, whereas a 
typical daytime delays on the NAS 80-nsec system is 35cp. 


Table 2. Results of least squares fitting of probe data. 
80-nsec dynamic C-2 memory (NAS) 


(a) Scalar probe 
# of Correl. RMS error/ 
terms Coefficient ave. value 
Order=1 
Ig 2 .053 66.5% 
Ta 2 9987 3.45% 
Igstg 3 .9987 3.43% 
Order=2 
Ig 3 831 34.8% 
Tq 3 9992 2.73% 
ToTg 6 .9992 2.63% 
(b) Vector probe 
# of Correl. RMS error/ 
terms Coefficient ave. value 
Order=1 
Ig Z 799 36.7% 
Ty 2 564 50.4% 
Tog 3 .967 15.6% 
Order=2 
Ig 3 831 34.0% 
Ta | 3 567 50.4% 
Iostg 6 .992 7.60% 
Order=3 
Ig 4 833 33.7% 
Ig 4 567 50.4% 
Totg 10 .993 


7.19% 


The measured data was fitted in the least squares sense with the expression 
of Eq. (1),using only rg, only rg, and then both r, and rg as variables. The 
order of fit was also appropriately varied (see discussion of vector probe 
fitting). The results of curve fitting are shown in Table 2(a), with the linear 
fit in ry only chosen (bold type). The resulting linear model has the form 


D 


ps = 1614+ 6.591 


(4) 
This expression invites the following observations. 

(1) Model parameters. The fact that a good fit - as indicated by the 
correlation coefficient and the RMS error - is achieved with a function of 
only ry in spite of the above variety of loading conditions supports the 
intuitive notion that access delays in scalar reads are explicit functions 
only of total memory traffic and not of load vector lengths or the frequency 
of vector startups. That is, 64 scalar accesses produce the same loading to 
scalar reads as a single 64-length vector access; otherwise, Dps would be a 
function of rz. 

(2) Constant term. The constant term of represents the effects of 
memory refresh on scalar reads. 

(3) Linear term coefficient. The coefficient of rg is a succinct direct 
measure of the effective access delay caused by loading. Unfortunately, 
measurements on other C-2 memories [7] show that these coefficients are 
not obviously relatable to Tp,, as might be expected. 


Use of the Scalar Probe 


Since the scalar probe measures only r,, it becomes a convenient 
measurement of total memory accessing. Toward this goal, measurements 
were made of the NAS 80-nsec C-2 during a daytime load by averaging 5 
million scalar probe accesses. The following estimate of total memory 
accessing was obtained, using Eq. (4). 


Dps =5.2cp ---> fq = .544 accesses/cp 


Below, fg will be used below to determine a complete memory load 
characterization. 


Vector Pr 
Calibration 


The vector probe consists of reading 64-length unit-stride vectors initiated 
at successive addresses 64 banks apart so as to scan all 256 pseudo-memory 
banks in four reads. In the vector test code, each vector read is initiated 
whenever the memory path becomes available, a maximum rate 
representative of many vector codes for which the single memory path is a 
data flow bottleneck. The same selection of loads are used as in the scalar 
probe calibration above (156 experiments total). The average probe delay 
measured in these calibration runs is 4.7 cp, close to a typical measured 
daytime value of 5 cp. 


The results of fitting equations to the vector read probe data are shown in 
Table 2(b). It was decided to choose the second order fit Dpy in both rg and 
Ig (bold type) because 

(1) fits in only r, or only r, of any order yielded unacceptable 
RMS error, and 

(2) in choosing between first, second, and third order fits 
involving both rg and rg, the second order fit produced by far the greate 
improvement from lower orders. 


It should be noted that, although higher-order approximations are 
necessarily better fits for the measured data, they often suffer from higher 
volatility over the entire range of fitting and offer less insight in 
extrapolations beyond the range of measurement. 


A graphical model 
The leasts squares fit corresponding to the choice in Table 2(b) is 


Dawe = 


3.34 - 6.23rg + 418.15 +8.22rg” + 467tgfg - 691452 ©) 
The constant term (3.34 cp) represents the effects of memory refresh; this 
agrees with no-load measurement. 


Contour representations of D,, are shown in Figure 2. Since rg=vl* rg, 
constant-vl lines can be drawn radially through the origin; they are shown 
for vl=1, 8, and 64. This presentation invites the following comments. 

(a) It is clear that the contour curvature increases as vl decreases, 
correlating with previous observations concerning the degrading effects of 
scalars. 

(b) In the region denoted "heavy vector load", the rg cannot be 
increased by increasing no-load accessing rate, indicating a saturated 
condition. Eq. (2) yields Foo; = .75, using Np = 128. 

(c) With preselected no-load startup rates, a maximum delay of 69cp 
is produced when vl=8, represented by the radial line shown. This 
combines the worst conditions of high vector startup rate (r,) and high total 
memory activity (rg), and shows that short vectors rather than scalars can 
be the most destructive of performance. 


It is felt that Figure 2 represents the principle feature of this load 
characterization, namely, the ability to succinctly depict an accurate 
nonlinear delay model dependent on only two parameters (fy and r,), with a 
third parameter (vl) easily represented and defining the limits of model 
validity. 
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Figure 2. Contour representation of Dpy 
(no measurements in shaded areas) 


Moderate scalar load 


Maximum delay 
(Foot = 43) 


Heavy vector load 
(Foat = .75) 


r/scatter in 


The increasing nonlinearity of the contours of Figure 2 in the scalar region 
invites more tests to investigate the limitations of the model in this 
region. Accordingly, the three load processors executed gather/scatter 
vector operations with VL=1,2,4,8, and 16, to produce a series of scalar 
experiments with .11 <ro=rg < .32 and 45 < Dpy <110cp. (Note that 
the gather/scatter proceeds at a maximum rate of .25 accesses/cp/processor 
on the C-2). To evaluate the error of the nonlinear model of Eq. (6), the 
latter was specialized to the scalar case by setting rg=rg, yielding 


D 3.34 +412.1, - 215.142 (7) 


ps ~ 
For these 5 points, an error ratio (RMS error/average value) = 6.4% was 
achieved, in comparison with the ratio of 7.6% of Table 2(b). Thus, the 
model of Eq. (6) appears quite good extended along the vl=1 line of Figure 
2. It must be pointed out that for VL=32 and 64, the group averages read 
from the probe became erratic, possibly indicating a surging under intense 
memory loading. 


Non-unit-stride accessing 


All of the load accessing reported above has involved non-unit-stride 
vectors. It is felt, however, that typically 10-20% of site accessing is non- 
unit-stride, principally due to complex arithmetic. To indicate the effects 
of such accessing, a series of tests were made with 3-processor loads of 
strides ranging from 1 to 127. Figure 3 shows selective resulting average 
delays measured by the vector test probe; they are seen to range to 277cp, 
an astounding delay for a 64-length access. The explanation is likely that 
successive accesses have the same effect as random scalar accesses, such as 
occur in the above gather/scatter; however, this vector access proceeds at 
full rate, unlike gather/scatters on the C-2. The largest average delay 
observed has been 284 cp for a stride of 75. This indicates the possibility 
of stride being a dominant load parameter under certain conditions, thus 
establishing a limit on a model involving only ry and rg. As the latter 
model is refined, it may be possible to add the average load stride as a third 
parameter. The resulting 3-D contours could offer insight similar to Figure 
2 concerning design criticality to a variety of loads. 


Figure 3. Delays of vector probe with non-unit-stride loads 
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In summary of the above models, ideniification of the memory as 80-nsec 
dynamic (the C-2 supports two other memory technologies) establishes a 
set of eight coefficients of Dps(Ta) and Dpv(fa-t5). Knowledge of the load 
rate parameters r, and r, then permits an estimate of the probe delay, i.e., 
{Tg Ts } ===> [ Dos: Dpv } 
It is also clear that the simple functional forms of Dpg and Dpy permits 
solution of the inverse problem, i.e., 
{ Ig, Tg } <=== { Dps> Dov } 
This raises the prospect of being able to infer both load rate parameters of a 
site from execution of the two probes. Specifically, Figure 4 shows that rg 
is derived from measurement of Dp, and use of Eq. (4); rg is then found 


from Dpy: Tg, and Eq. (6) by solving a quadratic equation. 


Figure 4. Modeling from site measurement probes 


This process can be viewed as a potential software performance monitoring 
tool, in lieu of monitoring hardware absent in the C-2. For example, 
these probes were run with a daytime load on the 80-nsec NAS C-2 
memory, obtaining rz = .491 from Eq.(4), ry = .083 from Eq. (6), and vl = 
t4/T, = 5.56. The relatively low average load vector length likely occurs 
because (1) both (scalar) addressing data accesses and floating-point*data 
accesses are counted in this measure, (2) the above model does not include 
non-unit-stride accesses, which would have a tendency to appear partly as 
scalars, and (3) although the average model error has been determined in 
Table 2, the error in computing the model inverse has not. Thus, this 
calculation should be regarded as illustrative at the time of this writing. 
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nclusion 

The above offers a mixture of formal and anecdotal modeling observations, 
representing the state of information gleaned from an initial series of 
dedicated tests. As this modeling process continues and all the major load 
parameters are identified, it may be possible to relate them both to (1) 
design parameters using simulation, and (2) application code performance 
The latter, however, will require development of a feedback model allowing 
the determination of loaded response from an unloaded accessing parameter 
characterization; this in turn necessitates a method for characterizing 
application code sensitivity to. access delays. In summary, this effort is the 
first step - parameter identification - in a much longer research study. 
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Abstract 


The Symmetry Series [Gif87] is a bus-based, shared- 
memory multiprocessor system which can contain from 
two to thirty 32-bit microprocessors with a total perfor- 
mance of around 100 MIPS. Each processor subsystem 
contains an Intel 80386/80387 microprocessor/floating 
point unit, optional Weitek 1167 floating point accelerator, 
and private cache. The system features a 53 Mbyte/sec 
pipelined system bus, up to 240 Mbytes of main memory, 
and a diagnostic and console processor. The cache 
hardware supports two different cache coherence policies: 
write-through and copyback. Symmetry represents one of 
the first shared-memory bus-based multiprocessor systems 
to use both write-through and copyback protocol with split 
transaction system bus. The performance of the two cache 
coherence policies has been measured and is compared 
here for various benchmarks and applications. 


1. Introduction 


The performance of bus-based shared-memory mul- 
tiprocessors is limited by the bandwidth supplied by the 
bus and memory subsystems, and by the demands made on 
them by each processor subsystem. Typically, such mul- 
tiprocessor systems use local caches to reduce a 
processor’s demand on the bus. The use of multiple 
caches on a common bus causes the cache coherence 
problem. Many different solutions have been proposed to 
solve this problem [ArB86] with a wide range of cost ‘and 
performance tradeoffs. The Balance multiprocessor sys- 
tem [TGF88] used write-through caches per processor 
[MaE84] to reduce the demand on the bus. Our studies 
[Tha87] showed that many writes generated by each proces- 
sor in a bus-based shared-memory multiprocessor can be a 
limitation in performance when increasing the number and 
speed of the individual processors. Thus this cache policy 
was a problem in the design of Symmetry whose goal was 
to have 4-5 times the performance of Balance. Another 
design requirement of the Symmetry system was that it be 
an extension of the Balance system and that it remain com- 
patible with the Balance I/O controllers. In addition, the 
Symmetry boards had to function when installed in Bal- 
ance systems with Balance memories. This requirement 
led to the implementation of two different cache policies, 
one write-through and the other copyback, the choice of 
which depended on the hardware environment. Copyback 
caches have been shown to increase performance of a sys- 
tem by reducing the writes to the memory in a uniproces- 
sor environment [AKC86]. Embedded hardware now 
offered us the ability to measure the performance of the 
system with the two different caching policies. 
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First, we review some of the protocols that have been 
described to solve the cache coherence problem. Next we 
describe how the Balance architecture was extended for 
the Symmetry Series. Finally, we describe performance of 
the two cache coherence schemes using bus utilization as a 
metric. 


2. Multiprocessor Cache Protocols For Bus Based Sys- 
tems 


The use of multiple private caches on a bus causes 
the cache coherence problem; a write to main memory 
from a processor or input/output (IO) device must be 
reflected into the contents of all caches that reside on the 
bus. Many different cache protocols have been developed 
to reduce a processor’s bus requirements while solving the 
cache coherence problem. 


The simplest approach involves the use of a write- 
through cache. In a write-through cache each cache block 
is tagged as either VALID or INVALID. On a read hit 
the data is returned to the processor from the cache, 
without any bus traffic required. On a read miss the block 
containing the requested data is read from main memory 
and installed into the cache and marked VALID. The 
data is then passed to the processor. All writes cause the 
data being written to be passed over the system bus into 
main memory. If the block is present in the cache the 
cached copy is also modified. To maintain coherence in a 
bus-based multiple cache system all caches watch the bus 
for writes (thus called snoopy caches.). When a cache 
detects a write on the bus to a block marked VALID, the 
block is simply invalidated. The next access to that block 
results in a miss and the modified data is retrieved from 
main memory. 


Since all writes are passed directly to main memory, a 
write-through cache does not reduce write traffic on the 
shared bus. This can be tolerated in systems where the 
bandwidth required by an individual processor is a small 
fraction of the bandwidth available on the bus as observed 
on Balance systems [Tha87]. As the individual processors 
get faster, the write-through protocol will consume too 
much bus bandwidth to support a moderate numbers of 
processors. 


Copyback caches do not send all write traffic through 
to main memory. The cached copy is written locally, if 
present, and the modified data is not written back to main 
memory until the cache block is replaced. Copyback 
caches have the potential for removing much of the write 
traffic from the bus in addition to the read traffic, but at 
some expense in terms of complexity. Copyback caches 
have been shown to give at least 30% increase in perfor- 
mance over write through caches [AKC86]. This suggests 


that multiple writes are done to a block before it is 
replaced. Thus copyback caches represent an attractive 
choice for bus-based multiprocessor system because they 
remove these writes from the bus. 


In a copyback cache there are at least three states: 
INVALID, VALID, and MODIFIED, where MODI- 
FIED indicates that the block has been written locally, but 
main memory has not been updated. When a MODIFIED 
block is replaced it must be copied-back to main memory. 
Reads are handled as they are in write-through caches. 
Most copyback protocols use write allocation, so that write 
misses cause a block to be allocated in the cache and the 
data to be written into the newly-installed block. A write- 
miss is turned into a read-miss on the bus. Write hits 
cause the data to be written directly into the cache block, 
leaving the block MODIFIED, and generating no bus 
traffic. 


Several different protocols have been proposed for 
maintaining coherency in multiple cache copyback systems 
[ArB86]. Most of these protocols solve the coherence 
problem by allowing a MODIFIED block to exist in only 
one cache at a time. Before a write can occur in a cache 
the cache must have ownership of the block it wants to 
modify. This means that the cache must have the only 
valid cached copy of that data in the system. A write miss 
at the cache requires a read on the bus to acquire the data 
and ownership of the block. Only in the case where the 
block is already in the cache and known to exist in no 
other cache is no bus activity required for a write. The 
data is written back to the memory only when the modified 
data needs to be replaced to make room for new data. 
The Berkeley [KEW85] and Illinios [PaP84] protocols are 
based on the ownership principal. They use write invalida- 
tion scheme to invalidate SHARED copies in other caches 
before a write is allowed to proceed in a cache. Thus 
these schemes allow multiple readers but only a single 
writer. Heavy active write sharing can degrade the perfor- 
mance of these systems because the blocks of data must be 
shuttled back and forth between caches sharing this data. 
Results from our studies suggest that active write sharing is 
almost insignificant in the parallel applications we ran on 
the Balance system [Tha87]. : 


The Dragon [McC85] and Firefly [ThS87]| protocols 
allow MODIFIED blocks to be held in multiple caches, 
but require all writes to these shared blocks to be broad- 
cast across the bus (ie. uses write broadcast as opposed to 
write invalidate). A SHARED state is usually added to 
distinguish between unmodified blocks that are privately 
held and those that may exist in other caches. This allows 
writes to non-SHARED blocks to proceed without gen- 
erating any bus traffic. These schemes work better when 
there is significant amount of active write sharing in the 
applications since they prevent shuttling of blocks between 
caches. 


In all of these schemes caches that own MODIFIED 
blocks must watch the bus for accesses to those blocks and 
ensure that the response reflects the modifications. There 
are two basic approaches for this. Either the cache 


responds to the access itself, or it holds up the access, — 


writes the modified data to memory, and then allows the 
memory to respond. 
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Caches must also watch the bus for accesses to 


blocks marked PRIVATE and SHARED and take actions 


appropriately. Bus transactions may cause the state of a 
block to change from PRIVATE to SHARED, or cause 
the state to change to INVALID. 


3. Design Criteria for Symmetry 


In 1984 Sequent introduced the Balance Series of 
multiprocessors, based on a shared-memory architecture 
and a high speed bus interconnect. The bus contains a 32 
bit multiplexed address and data path and uses a split 
response protocol. The split response protocol releases 
the bus between a read request and its corresponding 
response. This allows the bus to be used for other transac- 
tions during an otherwise idle period. The bus protocol 
defines three pipes, a read pipe, a write pipe, and an IO 
pipe, which allows memory accesses to proceed indepen- 
dent of IO accesses. It also allows reads to proceed 
independent of write accesses, except that requests to an 
individual memory subsystem must be serviced by that sub- 
system in order of receipt. The bus protocol limits the 
outstanding requests to 3 read requests, 2 write requests 
and 1 IO request. This is done in a distributed manner, 
with each requester maintaining a current pipe count. If 
the count shows that the required pipe is full no new 
requests will be issued. Responses are always returned in 
the order of the requests, obviating the need for a reques- 
ter tag. Requesters make a note of the current pipe count 
when their request is placed on the bus and count off the 
responses as they are returned. The bus, memory and pro- 
cessors all run synchronously at 10Mhz. The bus and 
memory subsystems support a sustained bandwidth of 26.7 
Mbytes/sec. 


Bus cycles are identified by a 5 bit Cycle Type field 
which is placed on the bus along with the address or data. 
The cycle type field identifies the current cycle as an 
address or data cycle, a read or write operation, and, for 
address cycles, the size of the transaction. Transactions 
from one to sixteen bytes are supported. 


Bus arbitration is handled by a central arbiter. Pro- 
cessor priority is assigned on a rotating round robin basis 
with the processor which last used the bus assigned the 
lowest priority. When a processor receives a bus grant but 
cannot use the bus because the desired pipe is full, the 
grant is held until the pipe becomes free. This provides 
each processor fair access to the bus. 


The cache is an 8 Kbyte, 2 way set associative cache 
[MaE84]. It uses a write-through protocol with bus watch- 
ing to maintain coherency. The bus watching is imple- 
mented with a second set of tags so that bus watching look- 
ups can proceed in parallel with processor cache accesses. 
Since the processor runs synchronous to the bus, maintain- 
ing two sets of tags is simple. Bus watching invalidations 
occur only on writes. Since write addresses are always fol- 
lowed by at least one write data cycle, the processor tags 
can be updated during that cycle, stealing the cycle from 
the processor. | 


This design was able to support thirty 0.7 MIP proces- 
sors in many useful applications, with a wide range of 
benchmarks showing up to 28 effective processors. There 
were, however, several applications where the write traffic 


generated by 30 processors was enough to swamp the bus. 
The size of the cache was also a limitation for certain float- 
ing point applications [Tha87]. As we were contemplating 
implementing the system with the latest generation of 
microprocessors it was apparent that supporting thirty 3-4 
MIP processors would require a different approach. 


4. Symmetry Series 


In 1986 Sequent began the design of a Symmetry 
Series multiprocessor system based on the Intel 80386 
microprocessor. A major goal was to be able to support 
as many processors as the Balance Series while maintain- 
ing compatibility with the Balance peripheral controllers. 
Since the new processor had 4-5 times the performance of 
the Balance processor we needed a 4-5 fold increase in the 
ratio of available bandwidth to processor demand, without 
drastically altering the bus subsystem. The bus bandwidth 
was increased by doubling the width of the datapath to 64 
bits. This doubles the sustainable bandwidth to 53.4 
Mbytes/sec. 


To reduce the demand placed on the bus by the pro- 
cessors we increased the cache size to 64 Kbytes per pro- 
cessor, and implemented a copyback protocol to remove 
write traffic from the bus. The protocol is similar to the 
Illinois cache coherency scheme [PaP84]. It has been 
shown that the performance of the Illinois scheme is as 
good as the best of the copyback schemes in systems with 
moderate sharing. Since our studies showed only minimal 
active write sharing a more complex scheme was not 
necessary. The Illinois scheme has been shown to have 
superior performance in handling PRIVATE blocks when 
compared to other coherence protocols [ArB86]. Our stu- 
dies suggested that this was a more important considera- 
tion. 


Two additional cycle type bits were added to the bus 
to extend the bus protocol to support copyback cache 
coherency scheme. The first bit is used to identify transac- 
tions using the extended 64 bit width of the bus. The 
second bit allows an address to be tagged with whether or 
not it should cause an invalidation. This can be used with 
a read address if a cache needs to insure that it holds the 
only copy of a block (je. gain ownership). 


5. Symmetry Cache and Bus Protocols 


The Symmetry cache and bus protocols are related to 
each other to support cache coherency in the system. The 
Symmetry cache protocol [Gif87] makes use of four cache 
states: INVALID, PRIVATE, SHARED, and MODI- 
FIED. 


These states are defined as follows: 


INVALID - Block is not currently valid in the cache. 


PRIVATE - Block has been read and does not exist 
in any other cache in the system. 


SHARED - Block has been read and may exist in 
another cache. 


MODIFIED - Block has been modified and does not 
exist in any other cache in the system. 
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The System Bus in the Balance multiprocessor sup- 
ported the following cycles to support the write-through 
protocol: 


RA - Read Address cycle 

WAi - Write Address with Invalidate cycle 
RDF/RDL - Read data first and last cycles 
WDIE/WDL - Write data first and last cycle. 


The System Bus protocol was extended in Symmetry to 
support the copyback cache coherency scheme by adding 
the following cycles: 


RAi - Read Address with Invalidate 
WA - Write Address 
IA - Invalidate Address Cycle 


In addition, two status lines were added to the bus to 
support the protocol. The first, SHARED, indicates that 
a RA cycle on the hit a block that exists in another cache. 
This lets a requester know whether to install a new block 
as PRIVATE or SHARED. The second, OWNED, indi- 
cates that a RA or RAI cycle on the bus hit a block that is 
held MODIFIED by another cache. This lets the memory 
subsystems know that a cache will respond to the request. 


The scheme, in general, works as follows: 


READ_HIT No bus activity is required and requested 


data is supplied to the processor. 


READ_MISS_ An RA type cycle is issued on the bus. If 
any cache has a copy of the block of data 
in PRIVATE or SHARED ss state it 
changes its state to SHARED, and asserts 
the SHARED line on the backplane. If 
any cache has the data in MODIFIED 
state it asserts OWNED, responds to the 
request and changes its local state to 
INVALID. The state could have been 
changed to SHARED instead of 
INVALID but our implementation does 
not allow this. The memory subsystem 
observes this transaction, noting the asser- 
tion of the OWNED signal, and takes a 
copy of the data as it is being passed from 
one cache to the next (called implied copy- 
back operation). This allows the cache to 
relinquish ownership. If no cache signals 
ownership then the memory responds to 
the request with its copy of the requested 
block. The receiving processor sets his 
tags to PRIVATE, if SHARED was not 
asserted, or SHARED otherwise. 


If the block is in MODIFIED state then 
this implies that this cache already owns 
the block and can complete the write. 
No bus activity is necessary. If the block 
is in the PRIVATE state, then the cache 
changes the state to MODIFIED and 
completes the write. If the block is in the 
SHARED state then the cache issues an 


WRITE_HIT 


IA cycle on the bus, causing all other 
caches to invalidate their copies (ie. write 
invalidate operation), and changes its state 
to MODIFIED. 

An RAji cycle is issued on the bus to 
obtain the current copy of the block and 
to signal all other caches to invalidate 
their copy. If any cache has the copy of 
the block in MODIFIED state then it 
responds to the request. Any cache which 
holds the block in PRIVATE or 
SHARED state invalidates its copy. If no 
cache holds the block MODIFIED then 
memory will respond to the request. The 
receiving cache installs the block as 
MODIFIED and completes the write. 


I/O devices do not participate in the caching protocol 
and therefore can issue writes to blocks that caches hold 
MODIFIED. These WAi cycles are absorbed by the 
caches which own the block being written. 


WRITE_MISS 


6. Implementation 


| The Symmetry system (Figure 1) consists of proces- 
sor subsystems, memory subsystem, disk controller(s), 
SCED(s), and Multibus adapter(s). The processor sub- 
system and memory subsystems are implemented to sup- 
port the new copyback and bus protocols. 


A processor subsystem (Figure 2) consists of an Intel 
80386/80387 processor and floating point unit pair, an 
optional Weitek floating point subsystem, Cache Memory 
Controllers (CMCs), Bus Interface Controller (BIC) and 
Bus Data Path (BDP) devices, System Link and Interrupt 
Controller (SLIC) [BKT87], memory chips for cache 
address and data fields, some address decoding logic and 
bus transcievers. Two such systems are implemented per 
board, and they are identical in all respects except they 
share a BIC. 


The cache coherency and bus protocols form the key 
part of the Symmetry system and are implemented across 
three VLSI devices: CMC, BIC and BDP. These devices 
are ail implemented in 1.2 micron CMOS technology; the 
CMC and BIC are implemented in gate arrays while the 
BDP in standard cell array. | 


The CMC has two modes, master and slave, which 
allow several of them to be cascaded to support set associ- 
ative cache organization. In current release two CMCs are 
used to support a 64K byte cache. The CMC is soft 
configurable to vary block and transfer sizes and support a 
variety of configurations. —The CMC communicates to the 
BIC when it needs access to the bus. It acts as an initiator 
of requests when it needs to service cache misses and as a 


responder to requests for accesses to blocks it holds 


MODIFIED (owned accesses). Such owned request 
addresses are queued inside the BDP. The BDP also 
queues addresses from bus transactions which cause invali- 
dates: The BDP generates owned and invalidate requests 
to the CMC accordingly. The CMC is able to supply data 
to the 80386 in pipeline mode with zero wait states on 
cache hits. 
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The CMC has address, address tag and state tag 
inputs to allow comparison of address tags and checking of 
the state for each block of data. Two sets of tags are pro- 
vided; processor side and bus side to allow concurrent 
access. The CMC accesses the bus side tags to perform 
bus-watching, when it needs to interrogate bus side state 
tags or when it installs a block. The CMC interrogates the 
bus side tags when it needs to differentiate between the 
PRIVATE and SHARED states [Gif]. The distinction is 
not maintained in the processor tags because of the 
difficulty of atomically changing states across an asynchro- 
nous boundary. 


The BIC contains two channels which handle both ini- 
tiator and responder functions. Each processor subsystem 
on a board uses one of the channels. The memory subsys- 
tems also makes use of the BIC but uses only the reponder 
functions of one channel. The BIC contains logic to arbi- 
trate between the channels, make requests onto the bus, 
respond to owned and invalidate transactions, load/unload 
the BDP queues and maintain the state of read, write and 
IO pipes. The BIC supports the bus watching or snooping 
function to maintain coherency across the system. Follow- 
ing an address cycle on the bus, the BIC informs the CMC 
of any necessary operations to perform on the bus-side 
tags. The BIC also receives hit/miss information from the 
CMC and uses this to load the appropriate BDP queues. 


The BDP contains 5 queues, an OWNED REQUEST 
address queue, an INVALIDATE address queue, a 
READ RESPONSE data queue, a WRITE DATA queue, 
and an OUTPUT DATA queue. The OWNED 
REQUEST queue contains the addresses of cycles which 
hit modified blocks in this processor’s cache. The 
INVALIDATE queue contains addresses which cause 
cache blocks to be invalidated. Whenever the OWNED or 
INVALIDATE queue is loaded an owned or invalidate 
request is made to the CMC. The READ RESPONSE 
queue holds the response of read requests generated by 
this cache’s misses. The WRITE DATA queue holds the 


data associated with write addresses in the OWNED 
REQUEST queue. The OUTPUT DATA queue holds the 


data associated with both owned read responses from the 
cache, and write requests (ie. copybacks) from the cache. 
Just like the CMC, both the BDP and the BIC are also 
software configurable to handle different block and transfer 
sizes. 


As an example, a transaction involving a RAI cycle 
on the bus is handled as follows. The cycle after the 
address is on the bus, the address is used to search the bus 
side tags. If a match is detected the state tags for the block 
are changed to INVALID. In addition, the CMC reports 
the result of the lookup to the BIC. The BIC uses the 
result of the lookup to generate load strobes for the two 
address input queues in the BDP. The BDP latches the 
address while it is on the bus, and can load it into either 
queue in the following cycle. If the state is PRIVATE or 
SHARED the address is loaded into the INVALIDATE 
queue. If the state is reported to be MODIFIED, the 
address is loaded into the OWNED REQUEST queue 
and, during the second cycle after the address, the BIC 
will assert OWNED on the bus. The memory controller 
recognizes the OWNED signal and aborts its processing of 
the request. Once a queue is loaded the corresponding 


queue empty flag is deasserted to inform the CMC that the 
queue requires service. These owned or invalidate requests 
to the CMC normally have higher priority than processor 
requests. However, if the CMC has made a request to the 
bus and is waiting for a response of its own it may ignore 
the request. To avoid deadlock and maintain data integrity 
the CMC will service the OWNED REQUEST queue if 
and only if the request in the OWNED queue appeared on 
the system bus before the request from this CMC. An 
owned RAT is serviced by transferring the data from the 
cache into the BDP output queue, and changing the 
processsor-side state tags to INVALID. The CMC then 
signals the BIC to respond to the request. 


Invalidate address (IA) cycles are similarly handled 
using the INVALIDATE queue in the BDP. If the opera- 
tion is just an invalidate no data is transferred. The CMC 
-just changes the processor-side tags to INVALID and pops 
the queue. 


The SLIC on the Symmetry system is used for 
configuring the system (setting up registers in the VLSI) 
and for handling interrupts in the system. The gates used 
by the kernel for mutual exclusion in the Balance system 
are no longer used because of the faster transparent paral- 
lel locks [Gif]. Unlike the Balance system both the kernel 
and users use the same type of locks for mutual exclusion. 


The memory subsystem can currently support up to 
240_ Mbytes of memory on 6 controller/expansion pairs. 
The controllers support two-way interleaving which allows 
the subsystem to support the 53 Mbyte/sec bandwidth of 
the bus. The BIC an are also used on memory con- 
Toller to support the bus interface and data path functions. 
The memory controller can respond to both wide and nar- 
row transactions of from 1 to 16 bytes, and are fully com- 
patible with the Balance environment. 


The memory controllers perform two special func- 
tions that are necessary to support the copyback protocol. 
The first is the recognition of caches claiming ownership of 
blocks. When a cache asserts OWNED in response to a 
request on the bus, the memory must avoid responding to 
the request. Often the controller is near completion of the 
request before it recognizes ownership and must abort the 
request. In cases where the controller was busy when the 
request arrives it can merely discard the request from the 
BDP’s queue and continue with the following request. 


The second function is the implied copyback opera- 
tion. If an RA request for a full cache block is claimed as 
owned by a cache the memory controller must watch for 
the response from the cache and grab the data as it is 
being passed over the bus. The data is written back into 
memory so that the copy in main memory is up to date and 
the cache can give up ownership. The BIC monitors the 
bus for the OWNED signal and maintains a set of flags 
that the controller logic examines as it pulls addresses out 
of the BDP. In addition the BIC determines which read 
response is associated with the owned RA and loads the 
data into the BDP’s queue. 


7. Performance Monitoring 


The performance of the Balance system was meas- 
ured using a hardware monitor, DYNAPROBE [Com77], 
which can measure different events for a given period. 
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The events can be setup using a logic patch board. This 
proved capable but was severely limited by the number of 
probe points, which is a real limitation for a multiproces- 
sor system. Thus, on Symmetry a decision was made to 
incorporate the performance monitoring hooks into 
hardware which can be accessed by special system 
software. The hardware includes counters, masks and 
multiplexing logic. The mask can be set and appropriate 
events of interest selected before the counters are started. 
The counters can be stopped and read by system software 
via SLIC chip. 

The types of events that can be measured include all 
types of accesses to the CMC by the processor, accesses 
from the bus to CMC (i.e owned and invalidate opera- 
tions), and state changes. This allows us to detect the 
accesses to shared blocks, etc. Other events that can be 
measured include the different types of bus cycles and 
other aspects of bus protocol. These features give us a 
unique opportunity to study this architecture and _ its 
behavior under different applications. 


8. Performance 


The performance of a multiprocessor needs to 


observed in the following domains: 


Single Thread Performance 
Parallel Program Performance 
Multi Stream Performance 


8.1. Single Thread Performance 


In figure 3 we show the performance of two small 
integer benchmarks that measure the processor perfor- 
mance across the Balance and different Symmetry 
configurations. The table shows that the project goal of 
increasing processor performance to 4-5 times that of a 
Balance processor has been achieved. Figure 4 shows the 
floating point performance of the Symmetry processor 
relative to the Balance processor. Both single precision 
and double precision performance has increased 
significantly for linpack and whetstone benchmarks. The 
narrow bus indicates a 32 bit bus while wide bus indicates 
a 64 bit bus. 


For small programs such as Dhrystone and Dhamp- 
stone benchmarks, in write-through system, the cache size 
and bus size has affect on performance. Increasing the 
cache size to two sets (64K bytes) increases the perfor- 
mance of these benchmarks by 5% and increasing the bus 
width to 64 bits has 3% increase in performance. The 
copyback system does not show such increases, increasing 
the cache size and bus width had a small affect for these 
programs (1%). There is less bus activity for copyback 
caches since these small program are now entirely running 
out of the cache. 


The 64K byte cache yielded a 99% hit-rate for several 
integer and floating point benchmarks for both write- 
through and copyback protocols. This is a significant 
improvement over the Balance 8Kbyte cache which gave 
95% hit-rate for integer benchmarks and 85% for floating- 
point benchmarks [Tha87]. 


8.2. Parallel Program Performance 


Several parallel. benchmarks and applications were 
run to observe the bus utilization of the Symmetry system 
with write-through and copyback caches. The benchmark 
and applications include: | 


Parallel Linpack Benchmark (LINPACK) 
2D Monte Carlo Simulation (STMC2D) 
Butterfly Switch Simulator (SIM) 

Ray Tracing (SMOKE) 


The bus utilization shows (Figure 5) a significant 
improvement for the Symmetry system with copyback 
caches over that with write-through caches. The bus 
approaches saturation much faster in a system with a 
write-through cache and appears to reach a limit with 
fewer than 16 processors. The reason for bus utilization 
increasing rapidly in a such a system is the number of 
writes. Active write sharing in parallel applications does 
not seem to affect the bus utilization adversely since it is 
insignificant (Figure 6). The IA/RAI cycles indicate shar- 
ing on the bus, the RAI cycles reflect the writes to invalid 
blocks, IA cycles reflect write to shared blocks. The copy- 
back policy removes the most writes from the bus and thus 
decreases the demand on the bus. 


Note that all the results are from a system with a 64K 
bytes, 2 set cache and wide bus in a 16MHz environment. 
The benchmarks are not tuned and were run as presented 
to us. There may be a potential to tune the algorithms and 
increase their performance. 


8.3. Multi Stream Performance 


At present, the parallel applications represent explicit 
attempts at using parallelism in speeding up an application. 
However, as Sequent and other manufacturers of similar 
multiprocessors have found, a great advantage can be 
taken of natural parallelism in a multi-user timesharing 
environment to deliver superior performance. The 
processes are scheduled independently across the available 
processors, thus providing a responsive and _ available 
environment. There is little sharing between the proces- 
sors, and hence little contention exists for resources. This 
is reflected by both high cache hit-rate and low bus utiliza- 
tion. Since the processes have longer time-slices than on a 
uniprocessor the large cache provides a large hit-rates 
(99%) because there are fewer context switches. 


In such an environment the performance of the sys- 
tem can be regarded as cumulative of the total number of 
processors. One way to approximate multi stream perfor- 
mance of a system to consider running n copies of a pro- 
gram on avn processor system. Figure 7 shows the roll-off 
in runtime when running 12 copies of a suite of programs 
on a 12 processor write-through system and 28 copies of 
the same suite on a 28 processor copyback system. The 
suite of programs includes several benchmarks (Dhry- 
stones, Sieve, Linpack, Whetstone, Puzzle, Sort), an Nroff 
application and some scientific applications (e.g, Butterfly, 
Gauss, Barsim). The roll-off for most of the benchmarks 
and programs starts occurring early in a write-through sys- 
tem. A _ gentle roll-off occurs for only one program 
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(Butterfly) in a copyback system. The roll-offs in the 
write-through systems are really associated with write 
traffic generated on the bus. The Butterfly roll-off 
improves with the use of memory interleaving (not shown). 
Note that this is only an approximation of system perfor- 
mance in such environment. 


9. Conclusion 


Symmetry represents one of the first bus-based 
shared-memory multiprocessor to incorporate a copyback 
cache with a split transaction bus. Embedded hardware 
incorporates performance monitoring hooks to monitor | 
dynamic behavior of the system. The ability to switch 
between write-through and copyback protocols has allowed 
us to observe the behavior of the two protocols for several 
parallel benchmarks and applications. Results show that a 
copyback cache has allowed us to incorporate much faster 
processors in a bus-based shared-memory multiprocessor. 
Results confirm that there is little active write sharing in 
parallel applications and justifies our choice for cache 
coherency protocols. 
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Figure 1: Block Diagram of Symmetry System 
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Figure 7: MultiStream Roll-Offs, Write-Thru vs Copyback System 


Figure 5: Bus Utilization for Parallel Applications 
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Two Parallel Processing Aspects of 
The CRAY Y-MP Computer System 


Steve Reinhardt 
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Mendota Heights, MN 55409 


ABSTRACT 


The CRAY Y-MP computer system is a paral- 
lel processing supercomputer. The architecture of 
the Y-MP is an evolutionary step from the CRAY 
X-MP series of computers; in many ways, the Y- 
MP is a "bigger X-MP". The Y-MP system pro- 
vides eight CPUs compared to four for the X-MP. 
On the Y-MP, the CPUs share 32 million words of 
central memory, compared to a maximum of 16 
million words on the X-MP. The clock period of 
each CPU decreases from 8.5 nanoseconds on 
recent X-MP models to 6.0 nanoseconds on the Y- 
MP. 


Many of the features of the X-MP which allow it to 
run common programs fast seem to be features 
which are particularly hard to scale to systems with 
more CPUs. In particular, the design and imple- 
mentation of the shared registers and multiple ports 
from each CPU to the central memory require care 
to preserve high performance as the number of 
CPUs grows. We investigate how the central 
memory responds to different levels of memory 
traffic and how the shared register access times 
affect the size of critical regions in common usage. 
In short, we look at how well the X-MP architec- 
ture scales from four processors to eight. 


1. Introduction 


This paper looks at the evolution of the CRAY-1 to the 
CRAY X-MP to the CRAY Y-MP to see how the architec- 
tural decisions made during that evolution affect the parallel 
processing capabilities of the Y-MP. We begin with a short 
historical perspective and describe the areas of the CRAY Y- 
MP computer system which are of particular importance to 
parallel processing. Then we look at the central memory sys- 
tem and how the X-MP and Y-MP respond to frequently 
occurring memory access patterns. Next we see how the X- 
MP and Y-MP shared register reference times affect the size 
(time of execution) of multiprocessing synchronization, which 
affects the size of critical sections of code which may be 
profitably processed in parallel. The emphasis is on how the 
X-MP/Y-MP architecture affects the ability of one program to 
use the whole machine effectively. 


2. The CRAY-1 and CRAY X-MP Computer Systems 


The firsts CRAY-1 computer system was delivered in 
1976.1 The central processing unit (CPU) is of register-to- 
register type. The clock period is 12.5 nanoseconds. Address 
registers are 24 bits wide and scalar registers are 64 bits wide. 
No virtual memory support is provided. Central memory pro- 
vides one port to the CPU. 
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The CRAY X-MP represented the first move by Cray 
Research into parallel processing and began a concentration on 
parallel processing which continues today. The Cray approach 
to parallel processing is to make the fastest general-purpose 
scientific processor possible and then to put together as many 
of those processors as possible. Architecturally this tends to be 
an evolutionary, incremental approach. 


The basic CPU architecture of the CRAY X-MP is the 
same as that of the CRAY-1. New features were three ports 
to central memory and flexible chaining. The first CRAY X- 
MP computer system (1982) contained two CPUs.* When one- 
and four-CPU models (1984) were introduced, hardware 
gather/scatter was added. Multi-CPU models provide clusters 
of shared registers, each of which includes a set of one-bit 
semaphores and shared address and scalar registers. Early X- 
MP models had a 9.5 nanosecond clock period; all models 
produced since 1986 have a 8.5 nanosecond clock period. 


3. The CRAY Y-MP Computer System 


Each CPU of the CRAY Y-MP computer system is 
nearly identical to that of the CRAY X-MP. (In fact, X-MP 
binaries can be run in an X-MP compatibility mode.) To sup- 
port larger address spaces, address registers on the Y-MP are 
32 bits wide. The 8 CPUs run at 6.0 nanoseconds. 


The central memory is 32 million words arranged in 256 
banks. The banks are interleaved on the low-order bits of the 
address; thus consecutively addressed words reside in separate 
banks. Each word is 64 data bits plus 8 check bits for 
SECDED. The bank cycle time is 5 clock periods (that is, 
each bank can only be referenced every 5 clock periods). 
Each CPU provides four ports to memory: two read ports, a 
write port, and an I/O port. The two read ports and the write 
port are under direct control of the CPU. The I/O port may 
be in use independently of the state of the other three ports. 


Block memory operations can use all three CPU ports simul- 
taneously. Scalar memory operations wait until all block 
transfers are quiet to ensure correct sequences between block 
memory operations and scalar operations within a CPU. 


A multiprocessing program on the CRAY Y-MP syn- 
chronizes its CPUs via a set of shared registers (a cluster) 
which includes 32 semaphore bits, eight 32-bit shared B (SB) 
registers, and eight 64-bit shared T (ST) registers. The sema- 
phore bits have atomic test-and-set, unconditional set, and 
clear instructions. The SB and ST registers may be read and 
written. Access to the SB and ST registers in a parallel pro- 
gram must be controlled with the semaphores. 


Physically, the Y-MP is a compact machine. The logic 
chassis consists of 41 modules: 8 CPU modules, 32 memory 
modules, and 1 clock module. Each module is 11" x 21.2" x 
1.4". Each module lies flat. The clock module is on the bot- 
tom, then 16 of the memory modules, the 8 CPU modules, 
and the other 16 memory modules. The footprint of the main- 
frame (including power supplies) is 79" long, 32" wide, and 
the machine is 76" tall. 


4. Parallel Processing Scalability 


In parallel processing, one major distinguishing point 
among machines is whether the CPUs share a central memory 
or whether most of the system memory is local to the CPUs. 
Shared-memory multiprocessors are generally considered 
easier to program; private-memory multiprocessors are con- 
sidered easier to scale.> Because the X-MP and Y-MP are 
shared-memory multiprocessors, we are very interested in sca- 
lability of that type of machine. In the CRAY X-MP/Y-MP 
line, the speed of two particular features is crucial to the paral- 
lel processing ability of the machine. The central memory 
must provide high bandwidth to all processors and minimize 
the effects of contention. The shared register clusters must 
synchronize CPUs quickly and allow fast communication with 
low overhead. 


4.1. Main Memory 


The CRAY-1 provides one port between the CPU and. 


memory. A program can slow down due to memory conflicts 
because of I/O references or because the program strides 
through memory and re-references banks before the bank cycle 
time has expired. A program which uses only stride-one 
references will not have bank conflicts. The CRAY X-MP 
and Y-MP provide three ports between each CPU and 
memory. Memory banks are grouped into sections (both X- 
MP and Y-MP) and subsections (Y-MP only). In addition to 
the possible memory conflicts in a CRAY-1, additional 
conflicts may arise from two memory operations from the 
same CPU competing with each other. In a multiple-processor 
X-MP/Y-MP system, additional conflicts arise from separate 
CPUs competing with each other.4 The CRAY Y-MP has no 
architectural difference from the CRAY X-MP in terms of 
CPU to memory connections, so the differences in memory 
contention should be attributable to the different number of 
processors, number of banks, bank cycle times, number of 
sections, etc. 


To illustrate the effects of memory contention, we have 


chosen one algorithm, matrix multiplication, and coded it in 
three different ways, each with its own amount of memory 
contention and bandwidth requirement. Matrix multiplication 
was chosen because it is important for many applications, 
well-known, simple to code, and highly parallel. Each algo- 
rithm does N-1 floating-point adds and N floating-point multi- 
plies for each element of an NxN matrix. The algorithms run 
at different speeds even on a single CPU; what we want to 
emphasize here is the speedup of each algorithm relative to 
itself. Each algorithm is parallelized by the use of Cray’s 
microtasking software? ©, 


The first algorithm (VXS) is a FORTRAN code which 
does 3N memory references for each element of the resulting 
matrix. Because VXS uses three memory ports per clock 
period per CPU in its vector portions, it is referred to as a 3- 
port algorithm. : 


The second algorithm (SDOT) is a FORTRAN code 
which does 2N memory references for each element of the 
resulting matrix; it is a 2-port code. Both VXS and SDOT 
are Basic Linear Algebra SuprOun es (BLAS) Level 1 rou- 
tines in the LINPACK naming scheme.’ 


The third algorithm (MXV) is a single FORTRAN loop 
which calls the assembler-coded mxv library routine to multi- 
ply a matrix times a vector. The mxv routine does N memory 
references for each element of the resulting matrix; it is a 1- 
port code. The mxv routine is a BLAS Level 2 LINPACK 
routine. 


Table 1 has both megaflops per second (MFLOPS) and 
speedup numbers for the above algorithms for a 1000x1000 
matrix on the CRAY X-MP and CRAY Y-MP systems. For 
this size problem, multiprocessing overheads and granularity 
of work are not issues, and total performance is related pri- 
marily to memory conflicts. MXV requires one port per clock 
period and scales linearly, since its memory traffic requirement 
is well within the bandwidth of the machine. SDOT scales 
moderately well on an X-MP; the higher total memory 
bandwidth of the Y-MP gives a better 4-CPU speedup on the 
Y-MP than the X-MP. VXS requires the full memory 
bandwidth (3 ports) of each CPU; on either machine speed- 


_ ups are well below linear. Again, the Y-MP gives a better 
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speed-up for a given number of CPUs. (These runs were 
made on the prototype hardware which at the time of these 
tests allowed VXS to run on up to 7 processors. Also, at the 
time of the tests, the clock period of the prototype was 6.33ns 


_rather than 6.0ns of the production systems. ) 


O~AINMNPWNH KIOWA MHPWNKIDAAARAMASPWN Ee 


Table 1. Matrix Multiply Speedups \_ 


From these results, one can conclude that MXV is the 
best of these algorithms for matrix multiplication, and that in 
général the smaller the memory load the better the speed-up. 
This is not surprising. The speed-ups for different memory 
loads“ carr-atso hélp a programmer predict how well another 
algorithm (with some known memory load) will scale (subject 
to degree of parallelism). A machine-independent conclusion 
is that an important measurement of a parallel machine may be 
not only how much memory load can it support per CPU, but 
also how much memory load can it support and still achieve 
linear speed-ups for highly parallel code. 


4.2. Shared Registers 


The only architectural difference between X-MP shared 
registers and Y-MP shared registers is that the Y-MP shared 
B-registers are 32 bits to match the Y-MP A-registers. X-MP 
A-registers and shared B-registers are 24 bits. The Y-MP has 
9 clusters, the p-CPU X-MP has p+]. 


The time to execute instructions which reference the 
shared registers has lengthened from the X-MP to the Y-MP. 
One reason for the slowdown is physical proximity. In the 2- 
processor CRAY X-MP, each CPU comprises about 140 dou- 
ble modules in half of four columns. The CPUs are situated 
one on top of the other. For speed, the shared registers are 
located near the boundary between the CPUs. In the 4- 
processor CRAY X-MP, each CPU requires the same space as 
the 2-processor. The CPUs are situated as four corners of a 
checkerboard, touching at the center of the 8 columns they 
occupy. The shared registers are located close to the point 
where all four CPUs meet. In the CRAY Y-MP, each CPU 
consists of one double module. The CPUs lie flat, one on top 
of another. The shared registers are distributed across all 
eight CPU modules. Thus the furthest two CPUs are 7 module 
slots away from each other. Also, with more CPUs, the 
shared register arbitration logic has grown. (Shared register 
access times slowed down from the 2-processor X-MP to the 
4-processor X-MP for this reason). 


Because of these factors, the access time for the shared 
registers and semaphores has increased. On the 2-processor 
X-MP, to test-and-set a cleared semaphore takes one clock 
period (CP). To read or write a shared register takes 1 CP 
(assuming no other CPUs are accessing shared registers). On 
the 4-processor X-MP, to test-and-set a cleared semaphore 
takes 1 CP. Reading/writing a shared register takes 3 CP for 
instruction issue (again assuming no conflicts). On the Y-MP, 
to test-and-set a cleared semaphore takes 3 CP to issue. 
Reading/writing a shared register takes 3 CP to issue (again 
assuming no conflicts). However, on the Y-MP, once the 
instruction issues, the cluster is blocked for another 3-7 CP, 
depending on the instruction, and a second cluster reference 
will hold for an extra 3-7 CP. A read from a shared register 
waits 7 CP after issue before the data is ready to be used. In 
all X-MP models and the Y-MP, shared register references 
are slowed down by contention with other CPUs by an 
amount proportional to the number of CPUs competing for 
access. 


Determining the effect of the slower shared-register 
accesses on actual programs® is difficult because generally 
they have large enough granularity that time spent in critical 
regions is small. (A critical region is a portion of code in 
which only one CPU may be executing at a time.) Instead, we 
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look at how the execution time of the critical regions used in 
microtasking has increased as the machine type and number of 
CPUs changes. Predictions of how that effects efficiency for 
a particular algorithm are algorithm- and_ granularity- 
dependent.? 19 We use a parallel program which starts a paral- 
lel loop and measures how long each CPU takes to enter the 
parallel region and get its first piece of work. See Figure 1 for 
the source. (The initial set-up time of getting extra CPUs 
from the operating system is ignored; the numbers are for the 
case where all CPUs are waiting at the beginning of the criti- 
cal region.) Table 2 shows the incremental times for the ith 
processor to get its work after the i-/th processor completes. 


On both the X-MP and the Y-MP, the cost for the initial 
CPU to set up the loop is larger than the times for subsequent 
CPUs. The times for subsequent CPUs are not influenced by 
the number of CPUs competing. This means that adding more 
CPUs to a program will not slow down critical region execu- 
tions by prior CPUs. 


As the access time for the shared registers grows, the 
speed advantage of using them instead of memory decreases. 
The advantage on the CRAY X-MP/2, for instance, is 1 CP to 
read a shared register versus 14 CP to read from™central 
memory. On the CRAY Y-MP, the difference is 1Q CP for 
the shared register versus 18 CP to memory. Indeéd, some 
work has been done to look ‘At the possibility of using only the 
semaphores of the shared register sets and not using the SB 
and ST registers, keeping all data values in memory instead. 


SUBROUTINE TIMEIT 

IMPLICIT INTEGER (A - Z ) 

PARAMETER ( TRIPLEN = 100 , NUMCPUS = 8 ) 

COMMON /CLOCKS/ICPU,ICPU2,OUTCS(NUMCPUS,4) 

CMIC$ GUARD , 

IF (CPU.LEINUMCPUS) THEN _! determine CPU id 
ICPU=ICPU+1 ! to store timings later 
MCPU=ICPU 

ENDIF 

CMIC$ END GUARD 


CMIC$ DO GLOBAL 
DO 100 I=1,TRIPLEN 
DO 50 JSPR = 1,10 
X = SQRT(1.0) 
50 CONTINUE 
100 CONTINUE 


! first loop just gets all 
! CPUs local to subroutine 


OUTCS4 = 0 
OUTCS3=IRTC() 
CMIC$ DO GLOBAL 
DO 200 I=1,TRIPLEN 
IT = IRTCO 
IF(OUTCS4.EQ.0) THEN 
OUTCS4 = IT 
ENDIF 
DO 150 JSPR = 1,1000 
X = SQRT(1.0) 
150 CONTINUE 
200 CONTINUE 


! use the second loop copy for 
! timing, so all CPUs are local 


! time for this CPU to enter is 
! OUTCS4 - MIN(OUTCS(G,3)) 
! for 1 <= i <= NUMCPUS 


! waste some time 


OUTCS(MCPU,3)=OUTCS3 
OUTCS(MCPU,4)=OUTCS4 
RETURN 
END 


! store timings off CPU id 


Figure 1. Shared Register Timing Source 


However, going to memory incurs a subtle penalty. In the 
case of a vector code, much of the time of taking the next 
piece of work (that is, executing the critical region) can be 
overlapped with vector instructions which have not yet com- 
pleted. However, if the critical regions kept key variables in 
memory instead of shared registers, the references to memory 


for those variables would force a hold issue condition waiting 
for the vector memory references to complete. Thus the shared 
registers function in a sense as an independent memory port. 


X-MP/2 _X-MP/4 


1 
2 
3 
4 
5 
6 
7 
8 


Table 2. Critical Region Times 
(in clock periods) 


5. Conclusions 


We have looked at two specific aspects of the CRAY 
Y-MP which are important to using the whole machine for 
one program, and compared timings for the Y-MP to its X- 
MP predecessor. In the area of memory contention, the per- 
formance of matrix multiplication coded to use only one 
memory port per clock period per CPU scales linearly with the 
number of processors. Using three memory ports per clock 
period per CPU produces less than linear speedups. From this 
we may conclude that for parallel applications the number of 
ports from each CPU to central memory may not be as impor- 
tant as the number of ports which can be referenced from the 
same program on all CPUs and still provide linear speed-ups. 


In the area of shared registers we looked at execution 
times of the standard Cray microtasking critical regions. 
These timings have slowed down on the Y-MP, but not as 
much as the timings for the shared register instructions might 
predict. A positive result of the timings is that extra CPUs do 
not slow down the speed of the critical regions for prior 
CPUs. Thus there is no need to worry about getting too many 
CPUs into a program and hurting overall performance. The 
slowdown of the critical region highlights the difficulty of pro- 
viding very fast shared registers as the number of communi- 
cating CPUs grows. Since most 4-CPU multitasking codes 
have granularity sufficient to make synchronization overhead 
very small, the effect of the longer critical regions on real 
applications is unclear. 
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ABSTRACT 


This paper investigates reconfiguration algorithms 
for loops and multi-dimensional grids in a hypercube 
architecture. The reconfiguration algorithms are 
invoked when a fault is detected and the original loop 
or multi-dimensional grid is no longer valid. The 
reconfiguration algorithms are able to reach an 
equivalent set of topologies within the same architec- 
ture in a distributed manner. We also propose fault- 
tolerant mapping strategies for embedding a loop or a 
multi-dimensional grid into a hypercube so that the 
resulting mapping facilitates the reconfiguration. 


1. Introduction 


Recently the problem of mapping algorithms to 
various computer architectures has received much 
attention in parallel processing [1, 2], VLSI systolic 
algorithm design [3], distributed processing and fault- 
tolerant computing [4, 5, 6, 7, 8, 9]. In parallel process- 
ing, specific algorithms are mapped into a set of proces- 
sors connected by a certain inter-connection network. 
For example, Fast Fourier Transformation and bitonic 
sorting algorithms can be mapped into perfect-shuffle, 
butterfly, hypercube, mesh-connected networks and so 
on [2, 3, 10, 11, 12, 13]. Based on algorithm transforma- 
tions, any iterative algorithm can be partitioned and 
mapped into fixed size VLSI systolic arrays [14]. 


More recently, mapping a basic inter-connection 
network into a host is also considered by many 
researchers {15]. This is usually done by exploiting the 
structures of both the embedded and the host networks. 
For instance, approaches to map loops, trees, and 
mess-connected networks into a hypercube have been 
proposed in [16, 17]. 

_ Once we have mapped a basic network into a host, 
it is desirable that if the host has a fault, it can 
reconfigure itself so that the basic network can continue 
operation with minimum interrupts. The reconfiguration 
process can be either centralized or decentralized. How- 
ever, the centralized scheme has several drawbacks, 
e.g., the vulnerability of the global supervisor, the lack 
of uniformity of each processor, and the tedious infor- 
mation collection, computation, and _ distribution. 
Decentralized scheme uses only local information but 
can achieve the same goal via the cooperation of pro- 
cessors. 
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Studies have been done on the design of the host 
for some basic networks so that the reconfiguration can 
be carried out. In most cases, the host is actually con- 
structed by adding some spare processors and links into 
the basic network, and the most popular application is 
to add redundant links and processors to loop and tree 
networks [18, 19, 20, 21, 22, 23]. 

However, in many cases, we do not have such lux- 
ury to build the underlined host machine, and once the 
host machine is selected, it is fixed. This is a more real- 
istic assumption, since most machines we use are 
manufactured by computer companies and, except for 
minor memory, CPU or 1/O upgrade, they can not have 
drastic changes. Since the host machine is fixed, the 
problem is then to find intelligent mapping algorithms 
to map the basic network into the host so that the 
reconfiguration can be carried out easily. In this paper, 
we discuss this problem. The host machine of interest 
is a hypercube and the basic networks are loops and 
multi-dimensional grids networks. 

Hypercube is selected as the host machine because 
it has many desirable features (16, 24, 25, 26]: 


(1) Each node has the same number of direct 
neighbors n, hence it is possible to overlap n different 
data transfers from any given node to its n neighbors to 
fully utilize the high total bandwidth of hypercube; 


(2) It has small diameter thus minimizing the com- 
munication cost; 


(3) A variety of basic graphs can be embedded in a 


hypercube. For example, trees, loops, multi-dimensional 


grids, and so on have been mapped into hypercubes; 
and 


(4) Its homogeneity and symmetry properties make 
each node equivalent in fault detection and recovery, by 
which we can achieve reconfiguration in a homogeneous 
and distributed manner; 


Some of the properties of hypercubes may be real- 
ized more efficiently by other networks individually [8]. 
For example, DeBruijn graphs have shorter logarith- 
metic internode distance and richer connectivity, tree 
networks are most suitable for divide and conquer type 
algorithms, also grid structures can be embedded with 
loop and linear arrays. Hypercube, however, has many 
desirable properties as mentioned above. Most impor- 
tantly, several versions of hypercube architectures are 
now commercially available, which makes experimental 
installment of our algorithms possible. [27] 


Loops and multi-dimensional grids networks are 
chosen since they are widely used for variety of applica- 
tions [2, 11, 12, 28]. We propose fault-tolerant mapping 
strategies and corresponding distributed reconfiguration 
algorithms for the loop networks, and then apply them 
to the multi-dimensional grids networks. 


2. Backgrounds 


In [16], topological properties of hypercubes were 
examined. A hypercube of degree d is an undirected 
graph with 2° nodes labeled 0 through 2° —1. There is 
an edge between a given pair of nodes if and only if the 
binary representations of their labels differ by exactly 
one bit. In this paper, we will represent a hypercube of 
degree d by drawing 2°° independent 3-cubes. We 
assume each 3-cube always has the most significant 3 
bits (MSBs) labeling as illustrated in Figure 2.1, while 
the less significant bits appear on top of each 3-cube in 
a left to right order, i.e., the least significant bit (LSB) 
is in the rightmost position. 


To distinguish the basic network and the host, we 
use the term vertex to represent the processor of basic 
networks, and node that of host. For simplicity, we 
use M, as the image of a loop vertex a into the hyper- 
cube, i.e., M, can be viewed as a d-bit binary code. 
XOR is the exclusive-or operator and can be performed 
on at least 2 operands which are binary codes. 


It is easy to see that a hypercube of degree d has 
subgraphs which are loops of length 4, 6, ... , 24 
respectively. If the length of the loop equals 2°, the 
nodes of the hypercube are fully utilized. However, if 
the length is less than 2°, then (2° — n) nodes are left. 
We can use these remaining nodes as spares to adjust 
the mapping in case of faults so that the new mapping 
is still a loop of length n. 

Another attractive property of hypercubes is that 


* s e ® m LL 
an n-dimensional grid of size 2 °x2 7x +--+ x2" can 


be perfectly mapped in a d-cube, where d is the sum of 
m,’s [16]. All the 2° nodes of this d-cube are occupied. 
However if we use a larger hypercube of degree d+1 as 
the host graph, then 2° nodes remain to be spares, 
which can be used to adjust the mapping in the pres- 
ence of faults so that the topology of the multi- 
dimensional grid can still be maintained. 


3. Failure Model 


The failure model used in this paper follows the 
model given in [23]. Each processor of the basic network 
has a unique state. The system is in an operational 
state if and only if all the distinct states exist. A fault 
will cause the missing of a state. The system should be 
able to reconfigure itself distributedly via the local 
operations of faulty-free processors until the missing 
state is recovered. 


The usages and definitions are as follows. 


e A node M, is in active state if it is one of the image 
nodes of the basic graph and faulty-free. It is in 
faulty state if it is faulty. If none of the above cases 


applies, then it is in spare state. Figure 3.1 illus- 
trates the state transition diagram of each node. We 
represent the spare state by 0, the faulty state by 
—1, and each active state by a unique positive 
integer. | 


e@ S(M,) is the state of node M;. Let M denote a set of 


nodes, (M,, Mb, , M,). The state of M is 
S(M) = (S(M,), S(M,), --> , $(Mj)), #21. | 
e £(S(M,)) represents the fault that state S(M,) is miss- 


ing. 

e Asystem state is valid if and only if all the n active 
states are present such that the active nodes labeled 
by the states constitute a loop or multi-dimensional 
grid. Otherwise it is invalid. 

e (M;,M,) is a recovery pair if M, is responsible for 
the fault of M;, and we say M, is the parent and M, 
is the child. A node M, can detect the states of all 
its neighbors. It updates the state information in its 
local copy, and recovers any faulty child. The 
recovery actions are assumed faulty-free. 


We list other assumptions about the failure model. 


@ Reliable fault diagnosis mechanisms are assumed 
available [29, 30, 31]. To ensure the correctness, 
each node M, periodically tests itself. If it is faulty, 
the state of M, becomes —1. 


e A node M, can detect the states of all its neighbors. 
It updates the state information in its local copy, 
and recovers any faulty child. The recovery actions 
are assumed faulty-free. 


e The links are faulty-free. 
e Only one error can be present in the system. 


4. A Reconfiguration Algorithm for Embedded 
| Loops in Hypercubes 


In this section, we describe the system 
specification, propose a distributed reconfiguration stra- 
tegy and three initial mapping schemes. More examples 
and detailed proofs of this paper are in [32]. 


4.1. State Assignment and Detection 


Let L, be a loop of length n mapped into a d-cube, 
where n is even and n < 2°. Each vertex z of L, has a 
unique label representing the state of x. And each node 
of the d-cube, C,, has a unique ID represented as a 
Gray code [2]. Let c*[L,] be the resulting mapping. 
Every node of OIL, ] can be viewed as having a unique 
d-bit ID code and up to n+2 possible states : n active 
states denoted by the distinct labels of L,, a spare state 
denoted by 0 and a faulty state denoted by —1. Figure 
4.1 illustrates the basic graph L,,, and C*(L,,| associating 
with a labeling. | | 

The state of Cn. can be expressed as an 2*-tuple 
S(M,, Mo, --° , M,.) = (S(M)), S(My), - °° , S(M,4))- The 
system state may become invalid in case of faults. For 


example, consider the system shown in Figure 4.1(b), (1, 
0, 0, 5, 2, O, 3, 4, 10, 0, 9, 6, 0, 0, 8, 7) is a valid system 


state, while (1, 0, 0, 5, 2, 0, 3, 4, 10, 0, 9, 6, 0, 0, -1, 7) 
is invalid since state 8 is missing. 


The node in state 7 responds to the missing of the 
state i+1, i.e., (E(i+1)), where the addition is modulo-n. 
M, would check the state of node M, periodically, save 
and update the work environment of node M,. If M; is 
faulty, M, will recognize this fault during its next detec- 
tion, and provide the saved work environment of M, to 
reconfigure the system to a valid state. 


The amount of overhead associated with the saving 
and updating of the work environment depends on the 
frequency of state updates. This involves the problem of 
optimal placement of checkpoints, which takes as fac- 
tors the state vector of each node, the reliability of 
each node, the degree of urgency of fault recovery etc. 


4.2. Strategy and Algorithm 


In this section we propose a distributed, homo- 
geneous strategy with which each node can recover the 
fault of state missing of its child. 


Lemma 4.1 Suppose vertices 7, 7, and k are three con- 
secutive vertices of a loop, and (M;,M,), (M,,M;) are 
recovery pairs, then there exists a node M;, so that M,, 
M,, M, and M; constitute the four corner of a 2-D plane 


of the hypercube [16]. 


According to Lemma 4.1, we can _— get 
XOR(M,, M,;, M,) = M;. That means if each active node M, 
has the backup work environment of its child, say M,, 
the code of M,, and the code of M,’s child, say M,, then 
it is possible to reconfigure the system under any possi- 
ble single fault E(S(M;)). When the state of M; has lost, 
node MM, can detect this fact and compute 
XOR(M,, M,;, M,) to get the code of M; which is adjacent 
to it. It will then reset the node M; by sending the miss- 
ing state, S(M,), backup work environment and code M,. 
If node M,; is a spare node, the system has reached a 
consistent valid state so that it can resume operation. 
However, if S(M;) >0, then a new fault, E(S(M;)), occurs 
which in turn will be detected and recovered by M,’s 
parent. It seems that the fault will be propagated along 
the loop and finally be absorbed by a spare node. It is 
easy to see that this strategy will result in a homogene- 
ous and distributed algorithm. 


However, the strategy does not work for all initial 
mappings. Consider a counter example shown in Figure 
4.2. A loop of length 12 is mapped into a hypercube of 
degree 4. Suppose £(2) occurs. New faults are gen- 
erated in sequence of E(4), E(6), E(8), E(10), E(12), to 
recover the previous fault. The system then gets stuck 
when it is trying to recover state 12 to a faulty node. 
Hence the correctness of the proposed strategy really 
depends on initial mappings. 


We exclude the case that the length of the loop 
equals 4, since no feasible spare nodes are available and 
any kinds of initial mappings will only lead to a face of 
the hypercube. Nevertheless, we still can solve this 
problem by moving two adjacent nodes, one faulty node 


and one active node, at the same time. In other words, 
more than local information is required. 


4.3. Initial Mappings 


To design the initial mappings, two alternative 
strategies, | and II, may be taken for consideration. 
Strategy I is a general scheme which can map any loop 
of even length | into a d-cube, where 4<1<2°-2. It 
ensures the correctness of reconfiguration process under 
any single fault, though the overhead may be large. On 
the other hand, Strategy II is to minimize the number 
of steps, in terms of state changes, needed to 
reconfigure a faulty system back to a valid state. It is 
normally employed when our major concern is to reduce 
the communication and state switching overhead. 


We propose the general scheme in Section 4.3.1 
and give two examples for schemes of strategy II in Sec- 
tion 4.3.2. 


4.3.1. Mapping I 


Definition The parity of a node M, whose code equals 
2%, °** at, iS even if XOR(z,,2,, --+ , 24) = 0; iS odd if 
XOR(,, t, °** ,%)=1, Where z,=Oorl,fori<i<d. 


Definition Suppose a loop of length n has been mapped 
into a hypercube of degree d. Let M,, M,, ... , M,_, be 
the corresponding image nodes. A sequence 5},, b,, ... , b, 
is a transition sequence if M,_, differs from M, by the 
bth MSB, for 1<b,<d,1<i<n-1, and M_, differs 
from M, by the b,th bit. We say 8, is the transition 
from node M,_, to M,. 


Example Consider the initial configuration shown in 
Figure 4.1(b), let M,= 1000, i.e., the code of the node 
whose state is 2. Then M, = 1100 (in state 3), M, = 1110 
(in state 4), ... , M, = 0000 (in state 1), and the transition 
sequence from 6, to b,, equals 2, 3, 1, 4, 1, 3, 1, 2, 4, 1. 

It is easily seen that if 6,, b,,....,5, 18 a transition 
sequence for some mapping, n > 2, then 8, # 6,,,. And if 
a=b,, fori<ga<dandi<i<n, then o must appear 
even number of times. 


Definition Let the transition graph, as shown in Fig- 
ure 4.3, denote that (M,, M,) is a recovery pair, and M, 
differs from M, by the kth MSB. 


Definition Let A(i) denote the maximun length of a 
loop which can be mapped into a hypercube of degree 
t —1 to tolerate any single fault, where 7>2. 

We have A(i)=2"'-2, and A(it1)—A(i)=2"-. 

The proposed initial mapping reserves two consecu- 
tive feasible spare nodes, i.e., one is with odd parity 
and the other with even parity. This can be done by 
separating a 3-cube, and assigning the states of six con- 
secutive loop vertices and two 0 states, to corner nodes, 
which is shown in Figure 4.4 (a) and (b). M, and M, are 
end nodes. 
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We can generate a loop of length six by adding 
dashed arrow and transition 1 from M, to M, (Figure 


4.4(a)). 


In general, when 2'<n<2'*'_2, for i>3, and n is 
even, we can construct a new loop of length n from a 
loop of length n—2 by deleting the dashed line which 
labeled n—2, including two new nodes into the loop, 
(Mn—2)/2 Ma _)jo), and adding one arrow between them. 
The transition graph is shown in Figure 4.5, where 6, 
depends on some },, 1<?<(n—2)/2. 


$+1 


Definition The ith interval contains codes and transi- 
tion sequence from May tO Macy ( or from Maga to 
Magi ). The intersection between (i-1)th and (7)th 
interval is not empty, but includes M,,). Hence the 
transition sequence, byrjysir bayyaor +++ b4ag41y belongs to 
the ith interval, where 6,()4, = ¢+1, for i>2. 


We propose the rules to decide the transitions, 6, : 


b= 3 
b, = 
bag = ttl, +23 


baits = OAs 2578 a +1 

base; = Oj~2, a 42< 5 < A(itl) — A(t) 

In summary, we copy the reverse transitions of the 
(i-1)th interval to that of the first half of the ith 
interval, and copy the transitions of the (:—1)th interval 
exclude the first transition, b4(),, to that of the 
second half of the :th interval. 


Example Let d=5, M,= 00000. Applying the proposed 
rules, we can Abpain fie transition graph as shown in 
Figure 46. M,-7-M,—> .. —+ M, + M, —+ M, a ee 
oe M, — M, is a loop of length 12, aid at least 4 bits 
are required. While M,-~ M,- ... + M, 6M. + M, + 

a M, —+ M, is a sop of lengehe 16, ne at ieast 5 bits 
are eequined. 


We show that the proposed initial mapping indeed 
forms a loop in Theorem 4.1, and outline its single-fault 
reconfigurability in Theorem 4.2. 


Lemma 4.2 for any possible i and j, 7+j, we have 
(1) M, # M;; 
(2) M, #M;, M; ¥ M,. 


Theorem 4.1 M,, M; form a loop of length 2n + 2, for 
O<i<n. 


Proof Since M;—+ M;,, and M, ren) Oe for 0<i<n-1; 
and My + M,, M, — M,, and for all distinct 7, 7, M,4M,, 
M; 4 M;, so no overlapping is _possible. Hence 
Mi eace are | eres oA ae) ae ee Ms —+ M, is a loop 
of ieaeth Qn +2. 


Lemma 4.3 If 7 is even, ee M41) Mz4o) = 
is odd, XOR(M;, M;4,, Mj4o) = M,49, for ¢ > 1. 
Lemma 4.4 If i is even, XOR(M; ; Mya; Miao) = M,_,» Ii ¢ 
is odd, XOR(M,, Mya Mj45) = Mj) os i>. 


Lemma 4.5 Let M, = XOR(M,, M,, M,), and 
My = XOR(My, M,, M, ). We have M,#M,, M, %M, 
M, #M,, M4 ~M; for i>0, and iA. In other words: 
M, and My, are spare nodes. 


Theorem 4.2 The system with the proposed initial 
mapping and algorithm can tolerate any single fault. 


4.3.2. Mapping II 


This section presents two initial mapping schemes 
which need only one step to reconfigure to a valid state 
in case of a fault. 


Lemma 4.6 Let code M, =; to - xq = 000 a O, 


f—1 fl i#—1 , ?—1 
And let code M,=2' 2) es ny= a, oy wy %, 


for 1<i<2d-—1, where z,=0 orl, andi<j<d. & is 


the complement of x. Then nodes M,, M,, ... , Moy 
form a loop of length 2d. 

Lemma 4.7 Let code M, = 9) Yo °°* Yq =100 --~ 001, 
where bit y; =0 or 1 and 1 =o <d. Codes 


i—l1 ofl t—1 


Mp=Yiva °° W@=¥2.% °° Md Se for 
1<i< 2d ~—1, where y-=0 or 1 and1<j<d. ¥ is the 
complement of y. We have XOR(M,_,, M,, M;4,) = M;, for 
O0<:<2d-—1, where addition and subtraction are 
modulo-2d. 


Theorem 4.3 For a d-cube, we can map a loop of 
length 2d, such that the system with the resulting map- 
ping and the proposed strategy can reconfigure any sin- 
gle fault within 1 step. 


Proof According to Lemma 4.6, M,’s form a loop of 
length 2d. We can include arrows to define the parent- 
child relations. 
M, —~ M, — M, — ... 
So we know M,_, is M,’s ae, and M, is ae 
parent. Since XOR(M,_,, M;, Mj41) = M; (Lemma 4.7) and 
M, ’s are spare nodes, M; can be ised immediately when 
Hore M,_, detects the fault E(S(M,)). 


0 


As a matter of fact, we can start from any node 
instead of the nodes with only 1 or 2 blocks of 0’s and 
1’s in their codes, e.g., 0000 --- 0000, 0011 --- 11 etc. 


Theorem 4.4 For any node N, in a hypercube of 
degree d, let N,, Nj, Nz --» » Nogyy No be generated by 
the transition sequence l, 2, ae ae eee , d. A loop 
of length 2d with image nodes Ny, Ni, No, «+» » Nog, CaD 
tolerate any single fault within 1 step, for any possible 
starting node N,. 


4.3.3. Performance Analysis 


Let c(M,) be the number of reconfiguration steps 
needed when node M, becomes faulty. We can analyze 
the performance of Mapping I by computing the aver- 
age number of reconfiguration steps as follows. 

Let My,M;, Moye Mays May > Mpeg yee Mg be 
the mapping sequence of a loop of length 2n. Then, the 
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total number of reconfiguration steps summing from M, 
to M,_, is: 


Case I: n-1 is odd 
n—l 


du e(M, ) =| e(M, ) + (Ms ) + reese + ¢(M,_,) | 
+ (Mya) + e(Mo) + [ o(My) + --- + (Mya) 
=(14+24..... — )42424(— — toe +1) 
; 2 2 
n —2n 9 
4= O(n’ ) 
Case II: n-1l is even 
2, e(M; ) = | e(M )+ c(M,) + .....+ ¢(M,.)] 
+ ¢(M,)+[¢(M,)+ ¢(M,) + ..... + c({M,_, )] 
ee a ee jot — a) 
2 2 2 
n—1 n—3 n—1 
+ ( )t oe + ( +1)] 
2, Z 
(n—1) 2 
= +2= O(n’ ) 


Similarly, the result for summing from M, to MM, 
is the same. Thus, we have O(n) as the average number 
of reconfiguration steps for Mapping I. 


The one-step reconfigurable property seems to 
make Mapping II a better choice than Mapping I in 
terms of the number of reconfiguration steps, especially 
when n is 0(2"). However Mapping II allows only limited 
length of loops. 


5. A Reconfiguration Algorithm for Embedded 
Multi-Dimensional Grids in Hypercubes 
This section discusses the reconfiguration algorithm 
and initial mapping strategy for mapping a multi- 
x2 x oe Ke 


dimensional grid of size 2" into a 


hypercube of degree d +1, where d = $)} ™m,. 
onl 
5.1. State Assignment and Detection 


m 
ey > ex a Cotte 
A(t, to, ps ‘,.)5 


We identify each vertex of a 2" 
grid A by an n- -tuple 
O<i- <p, —land p, =2 i 
Definition The state of grid vertex A(t, 7%, -- 
defined as follows. 

S(A(ty, bo, 7° My) = yoy 1 By + toby 
+ tghyle °° By + 
oe taal +2,4+1 

Ss) U4; +1 


j=l 


By 


oeoeoeoee @ 


n 
Il Hy loy<cn. 
k=j+l 
Hence the system can have up to y,X#oX -°° X#, 
active states, one invalid state —1, and one spare state 
0. 


— 


where a, =1, and a, 
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Definition A state missing due to the fault of a node 
changing its state to —1 is called an original fault. If 
the fault is induced by a state propagation, it is called 
a propagated fault. These two faults are distinguish- 
able since the original fault will change the state to —1, 
while the propagated fault will switch the node to a 
state which is greater than 0. 


Definition We define the new parent-child relationship 
as follows. Node A(z,, --- , 1%), t4,, °++ ,¢,) is responsi- 
ble for detecting and recovering the fault of node 
Ali ¢ af tlt § => 94,), where 2< 7 Sm and ‘the 


addition is modulo-p,. 


5.2. An Initial Mapping Strategy for Multi- 
dimensional Grids 


Theorem 5.1 Let G;" denote the ith code in an m-bit 
Gray code sequence, where G,;=00::: 0, and 
0<i<2”"-1. Let F; denote the code of the ith loop 
state when we map a loop of length 2”~ into a hyper- 
cube of oe m. For any vertex W= A(i biystags PES St) 


let My = Gp * || Gp -|l--: WG ta WR +1 then these 
image nodes still maintain a multe dimensional grid of 
size 2 1X2 ?x + KD 

Proof 


(1) Two nodes of the hypercube are adjacent iff 
they differ by one bit. 

(2) Two vertices X-=Al(i,,%, °°: ,4,) and 
Y=A(ji, Jo °*° Jn) Of the grid are connected iff there 
exists exactly one index a, such that all the other 
entries are the same, and i,=j,+1, i-e., 4 = 9, for 
1<il<n,and!l a. 

Case (a): If an: 

My = GWG i WG 
(eae 


Since G;",, differs from G; ° 
(Note that M, and My, are 


I 
ee 
eee 


my m +i 
- (|G; ee aes 
n—1 n 
Woes m +i 
es We as 


by one bit, so M, 


a my 
My = Gy [| Gi, 


differs from My by one bit. 
in the same dimension a.) 


Case (b): If a =n: 
m my m +1 
My = Gj | ais | Gi Pia ’ 
mo ti 
ae (or Fis» ) 


According to the proposed loop construction, we 


m, mo m +1 
My = G; oe” WF 


m +1 +1 
know F;" differs from F;" 1 by one bit, so My differs 
from My by one bit. 
This completes the proof. 


5.3. Recovery Strategy 


Suppose node y = Ali, t °+* 5, ty °° > %,) SED 
erates a fault, either original . or propagated, it will be 
detected by + =A(tj, t,, - °° —1,ta, °°: ,%), where 
1<j<n. The recovery sesee dite for the parent x is as 


follows. 
Reconfiguration Algorithm 


Case (1): If j = : performs loop reconfiguration along 
its nth dimension. In other words, either a spare node 
can immediately be available or a propagated fault is 
generated which will propagate along the nth dimension 
until it is absorbed. 


Case (2): If 7 4 and the error is original : changes its 
state S(z) to —1, thus generates an original fault which 
can be detected by A(i,, i, ---+ ,4—2,t4,...,¢,) and 
A(t, to, PONE ey tp = 1, tea eee ts a, — i: trap lA dy +); for 
1<k<nandj<k. | 


Case (3): If 7 4n and the error is a propagated fault : 
ignores this kind of fault to ensure the parallel propaga- 
tions. 

For each X = A(tj, te °° ° 
of length 2"" along its nth dimension which has 2 
nodes. In other words, we can fix ¢,, #, .. , ¢,_, and 


» 4-1) Jn)y WE have a loo 


vary j, to get a Gray code sequence of length 2 °. 
Since we can Vary ty, to) sy ty and do the same 


thing, we have a total of T] 2’ such "parallel" loops. 


1=1 
Hence we have lemma 5.1 and Theorem 5.2. 


Lemma 5.1 If Y=A(j, do, ° 
2'x2°x +--+ x2 ”” original faults, including E(S(Y)), 
will be generated by performing the proposed 


reconfiguration algorithm. 


- ,j,) 18 faulty, then 


n—1 


Theorem 5.2 The grid system with the proposed algo- 
rithm and initial mapping can tolerate any single fault 
within O( [J 2"! ) steps. 


1=1 


Example Figure 5.1 is an example of the above stra- 
tegy. Suppose node 0010010 ( in state 10 ) becomes 
faulty, say event (0), a sequence of events may occur. 


In this example, only original faults are generated to | 


simplify the description, while faults which can not be 
recovered immediately should produce further pro- 
pagated faults along the nth dimension, and eventually 
will be absorbed by a spare node. 


6. Conclusion 


In this paper, we have presented distributed 
reconfiguration algorithms for loops and multi- 
dimensional grids embedded in hypercubes. We have 
also proposed initial mapping algorithms to map loops 
and grids on hypercubes to facilitate reconfiguration. 


We are currently investigating one-step 
reconfigurable schemes which are more efficient, in 
terms of the degree of hypercubes, than Mapping II. 
We also invstigate some engineering issues such as the 
effect of the reliability of each node on our mapping 


schemes. 
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Figure 2.1 The fixed codes for a 3-cube 


Figure 3.1 The state transition diagram of node M, 
Transition 1 --- Node M, becomes faulty 
Transition 2 --- Faulty node is repaired 
Transition 3 --- Spare node is activated 
Transition 4 --- Spare node becomes faulty 


(a) 


Figure 4.1 An example of an initial configuration for n=10, d =4 
(a) Basic Graph for Ly, 
(b) The initial configuration of Clb 
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interval 3 

Figure 4.2 A counter example of the proposed strategy (4 bits ) 

i interval 4 
Figure 4.3 The transition graph from M, to M,. eases 
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Figure 4.6 Transition Graph for d = 5, M, = 00000. 
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Figure 4.5 The attached transition graph Figure 5.1 An example of how recovery protocol works when node in 
state 10 of a2 X 4 X8 grid becomes faulty 
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Dynamic Computational Geometry on Meshes and Hypercubes 


Laurence Boxer * 

Department of Computer and 
Information Sciences 
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Abstract 

Parallel algorithms are given for determining geometric prop- 
erties of systems of moving objects. The properties investigated 
include nearest (farthest) neighbor, closest (farthest) pair, col- 
lision, convex hull, diameter, and containment. Several of these 
properties are investigated from both the dynamic and steady- 
state points of view. Efficient, and often optimal, implementa- 
tions of these algorithms are given for the mesh and hypercube. 


1 Introduction 


Suppose n point-objects are moving in Euclidean space such that for 
each object, every coordinate of its motion is a polynomial of time. 
For such a system, we present parallel algorithms described in terms 
of abstract data movement operations to solve a variety of problems 
involving proximity, collision, containment, and convexity. We give 
solutions to these problems for the dynamic situation and for the 
steady-state situation (as time approaches infinity). We give imple- 
mentations of these algorithms that are asymptotically optimal on the 
mesh and are efficient on the hypercube. 

This paper was motivated by serial algorithms for dynamic com- 
putational geometry given in [Atal85]. 


2 Preliminaries 


The notations O, O, and 1) will be used in this paper to mean, intu- 
itively, “order at most,” “order exactly,” and “order at least,” respec- 
tively (see, e.g., [Mill88aJ). 

We will use the terms processor and processing element (PE) in- 
terchangeably. 


2.1 Mesh-Connected Computer 


A mesh of size n has n PEs arranged as an n!/? x n!/? lattice. The 
PEs of a mesh of size n are frequently numbered from 0 to n — 1 so 
as to impose an order upon them. In this paper, we assume that the 
PEs are indexed via prozimity order [Mill88a] (see Figure 1). The 
properties of proximity order that are useful to us are the following. 


1. In a mesh of size n, if0 <i << n—1 then PE; and PE;,,; are 
neighboring PEs. 


2. A mesh may be recursively subdivided into sub-meshes such 
that each sub-mesh contains consecutively indexed PEs. 


“Research partially supported by a grant from the Niagara University Research 
Council. 


tResearch partially supported by National Science Foundation grant number 
DCR-8608640. 


Russ Miller ! 

Department of Computer Science 
226 Bell Hall 

State University of New York 
Niagara University, NY 14109, USA Buffalo, New York, 14260, USA. 
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Let & be a nonempty subset of the processors of a mesh. We say 
u is an interval or a string of the mesh if and only if there are integers 
to and 241, 0 < to < i, <n, such that D = {PE ling <i < a1}. We 
will prefer the term string in this paper, as we will often use the term 
wnterval to refer to a subset of the real line. 


Figure 1: Proximity order for a mesh of size 16. 


2.2 Hypercube Computer 


A hypercube of size n, where n is a nonnegative integral power of 2, 
has n PEs or nodes indexed by the integers {0,1,...,n — 1}. If we 
view each integer in the index range as a (log, m)-bit string, two PEs 
are connected by a bidirectional communication link if and only if 
their indices differ in exactly one bit. 

It is useful to re-label the PEs of a hypercube so that consecutively 
labeled PEs are adjacent, and so that we may split the hypercube into 
subcubes such that the subcubes consist of consecutively labeled PEs. 
A commonly used method of ordering the PEs of a hypercube with 
these properties is the binary reflected Gray code [Rein77]. Through- 
out this paper, processors in a hypercube will be labeled not by node 
number, but according to a binary reflected Gray code ordering. 

A string of processors in a hypercube will be a nonempty set of 
consecutive processors according to Gray code order, i.e., a set & of 
PEs for which there are integers 79 and 7, such that 0 < 19 < 45 <n 
and D = {PE;|to < j < i1} according to a binary reflected Gray code 
ordering of the processors. 


2.3. Pieces and the function \ 


Input to problems in this paper consists of descriptions of real-valued, 
or more generally, Euclidean vector-valued functions fo(t), fi(t), ---; 
fn—1(t) defined on the interval [0,00). We assume that at the start of 
a problem, no processor contains a description of more than one of the 
functions fo,...,fn—-1. For many problems, these functions describe 
the motion of point-objects Po,..., P,_1, respectively, in Euclidean 
d-dimensional space. If every component of every function f; is a 
polynomial of degree no greater than k, then the collective movement 


of the points is referred to as k-motion. For convenience, we assume 
that no pair of the points have the same initial position. That is, 
fi(0) A f;(0) for iF G,0St7 <n. 

Given a set of real-valued functions F = {fo,..., fa—1} defined on 


[0, 00), it is often useful to describe the minimum function 
(1) 


Define a piece of the minimum function generated by F to consist of 
a description of some f; and an interval I C (0,00) such that h = fi 
identically on I and such that h is not identically equal to any f; over 
any interval J C [0,0o) such that I is properly contained in J. A 
piece of the mazimum function generated by F is defined similarly. 
If h(t) and h(t) are real-valued functions defined on [0, oo) whose 
pieces are generated by a family of functions F, then a piece of hy —he 
generated by differences of members of F consists of a description of 


a function g and an interval I C [0, 00) such that 


h(t) = min{fo(t),..., fn—1(t)}- 


1. there exist f1, fo € F such that g = f1— fz identically on [0, co), 
2. hy —h2 = g identically on J, and 


3. hy — hz is not identically equal to g on any interval J C [0, 00) 
such that J is a proper subset of J. 


Many of our algorithms have processor requirements related to the 
number of pieces of a minimum function A(t). Let A(n, s) be the max- 
imum number of pieces of functions h(t) = min{fo(t),..., fra_1(é)}, 
where the maximum is taken over all sets F = {fo,..., fa_1} of con- 
tinuous real-valued functions defined on [0,00), no pair of which in- 
tersects more than s times. 

To describe the behavior of X(n, s), we use the “inverse Ackermann 
function” a(n), a description of which is given in [Hart86]. Note that 
a(n) is a monotone nondecreasing function that grows to oo extremely 
slowly. For example, [Hart86] shows that 


2 


a(n) <4 forn <2? , (the number of 2’s in the tower is 65536,) 


and that if we denote log) n = logn, and more generally, log’ +1) n= 
log(log"*? n) for integer k > 0, then 


a(n) = O(log”) n) for all integer j > 0. 


Theorem 2.1 The following results concerning the function A(n,s) 
are known. 


1. X(n,1) = n and X(n, 2) = 2n — 1 [Dave65]. 
. A(n, 3) = O(n a(n)) [Hart86]. 
3. For s > 3,A(n,s) = Q(na(n)) (this follows from the 


previous result and the fact that X is an increasing 
function of s), and 


. For s > 3, A(n,s) = O(n[a(n)]O(@()"~*)) [Shar87]. 
wi 


For all problems considered in this paper that use the function 
A(n, s), the parameter s will be a bounded integer. Under such circum- 
stances, the above implies that for “reasonable” values of n, A(n, s) is 
essentially O(7). 

The next result gives a property that will be useful for bound- 
ing the number of processors in the algorithm associated with Theo- 
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rem 3.2 for constructing the min function. . 


Lemma 2.2 [Boxe87a] For all positive integers n and 8, 2\(n,s) < 
A(2n,s). Bo 


An interval is nondegenerate if and only if it contains more than 
one point. Two intervals have a nondegenerate intersection if and 
only if their intersection contains a nondegenerate interval. If pis a 
piece of a function f and q is a piece of a function g, we say p and q 
have nondegenerate intersection if and only if the interval of p and the 
interval of g have nondegenerate intersection. The next two results 
give useful bounds on the number of pieces in a “combined” function. 


Lemma 2.3 [Boxe87a] Let f(t) and g(t) be real-valued functions de- 
fined for allt > 0. Let m and n be positive integers. Suppose f(t) 
has m pieces and g(t) has n pieces. Then the pieces of f(t) have, 
altogether, at most m+n nondegenerate intersections with the pieces 


of g(t). = 


Lemma 2.4 [Boxe87a] Let p and s be positive integers. Let f(t) and 
g(t) be real-valued functions defined for allt > 0. Suppose that for 
every piece of both f(t) and g(t), the function of the piece is a polyno- 
mial whose degree is at most s. Assume that the pieces of f(t) have p 
nondegenerate intersections with the pieces of g(t). Then the function 
min{ f(t), g(t)} has no more than p(s +1) pieces. & 


2.4 Data movement operations 


Our algorithms are given in terms of machine independent fundamen- 
tal data movement operations. We assume data values are distributed 
among the n PEs of a parallel machine so that no PE has more than 
©(1) elements. The operations are performed simultaneously within 
disjoint strings. (Notice that the entire machine corresponds to a 
single string.) 

Operations not based on sorting include semigroup computation 
and broadcast. Each of these may be implemented on a mesh in 
©(n1/?) time and on a hypercube in O(log n) time. 

Sort-based operations include sorting, concurrent read, concurrent 
write, parallel prefix, grouping, and splitting the data evenly among 
the processors. Each of these may be implemented on a mesh in 
©(n1/?) time, on a hypercube in @(log? n) time, and on a hypercube 
in expected O(log n) time. 

See [Mill88a] for descriptions of these operations and details con- 
cerning the implementations and proofs of the algorithms. 


3 Constructing the MIN function 


In this section, we show how a description of the minimum function 
may be constructed efficiently, under relatively mild restrictions, from 
descriptions of a set of real-valued functions. 

The proof of Lemma 3.1 gives an algorithm to construct a descrip- 
tion of the function min{f(t), g(t)}. This algorithm can also be used 
to construct a description of the function that results from applying 
any of a variety of operations (e.g., max, sum, product) to a pair of 
real-valued functions. . 

If I is a subset of the domain of the function f(t), we denote by 
f\r the restriction of f toI. That is, f|z is the function whose domain 
is I such that f{z(t) = f(t) for all t€ J. 


Lemma 3.1 Let © be a family of real-valued functions defined on 
[0,00). Let f(t) and g(t) be real-valued functions defined on [0, 00) 
by pieces generated by &. Let s be a positive integer. Suppose the 
function of every piece of f(t) and of every piece of g(t) has a O(1) 
storage description and can be evaluated for a givent in O(1) time by 
a single processor. Suppose that if I is the nondegenerate intersection 
of the intervals of a piece of f(t) and a piece of g(t), and F and G are 
members of ® such that f|; = F identically and g|; = G identically, 
then there are at most s solutions to the equation F(t) = G(t), and 
these solutions may be calculated by a single processor in ©(1) time. 
Suppose m is a positive integer such that the total number of pieces 
of f and g ts at most m. Suppose the pieces of f and the pieces of 
g are stored in disjoint strings of a mesh of size m or a hypercube of 
size m, at most one piece per PE. Then a description of the function 
h(t) = min{f(t), g(t)} can be constructed by the mesh in O(m‘/?) 
time; by the hypercube in O(log” m) time; and by the hypercube in 
expected O(log m) time. 


Proof: The general algorithm is given in 8 steps. 


1. Since the pieces of f and the pieces of g are in disjoint strings, 
there is a PE, such that, without loss of generality, pieces of f 
are stored in PEs whose labels are at most z and pieces of g are 
stored in PEs whose labels are greater than z. Broadcast z to 


all PEs. 


2. In parallel, each PE containing a piece of f or a piece of g creates 
two sort-records, Left and Right, each containing the following 


information. 


e A tag whose value is “Left” in Left records, “Right” in 
Right-records. 


e A source field, whose value is the index of the PE. 
e A description of the piece. 


e An endpoint field, whose value is the left endpoint of the 
interval of the piece for the Left records, the right endpoint 
for the Right records. 


e An “other_piece” field, initially undefined. 


3. Sort all of the Left and Right records together with respect to 
the endpoint field. Ties should be broken in favor of a Right 
record. 


4. A string of f is a string whose first PE contains a Left record 
of f and whose last PE contains a Right record of f such that 
no intermediate PE contains a record of f. Records of f are 
recognized by having source < x. A string of g is defined anal- 
ogously. Use a concurrent read so that each Left record of f 
(respectively, g) finds the PE-index of its corresponding Right 
record of f (respectively, g). In parallel, the first PE of each 
string of f broadcasts throughout its string a description of its 
piece of f, which is taken by each record of g in the string as 
its other_piece field. In parallel, the first PE of each string of g 
broadcasts throughout its string a description of its piece of g, 
which is taken by each record of f in the string as its other_piece 


field. 


5. A concurrent read is performed based on the source field so 
that each PE gets back copies of the records it started with 
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in Step 2, with all components now defined. Thus each PE 
containing a piece of f (respectively, g) now knows the left- 
most and right-most pieces of g (respectively, f) with which its 
piece has nondegenerate intersection. 


. We now construct the “subpieces” determined by nondegenerate 


intersections of a piece of f and a piece of g. All PEs act in 
parallel as follows. 


If PE, contains a piece p; of f, then PE; handles the leftmost 
and rightmost nondegenerate intersections of p; with pieces of 
g by performing the following three steps, once for the record 
with tag “Left” and once for the record with tag “Right.” 


(a) Compute the intersection, I, of the intervals of p; and the 
tag-record’s other_piece. 


(b) Determine the (at most s) solutions to the equation f|;(t) = 
g\z(t). 

(c) The roots found in (b) determine at most s + 1 closed 
nondegenerate subintervals of J with disjoint interiors. For 
each such subinterval J, determine which of f|; and gly 
is minimal by comparing f(t;) and g(t;), where ty is any 
interior point of J. 

Let p be a piece of f and let q be a piece of g such that 

p and q have nondegenerate intersection and q is neither the 
leftmost nor the rightmost piece of g whose intersection with 
p is nondegenerate. Then the interval of q is contained in the 
interval of p. Hence, in the PE in which q is stored, the Left- 
record and Right-record have identical other_piece fields. Such 
processors PE; perform the following two steps. 


(a) Determine the (at most s) solutions to the equation f|;(¢) = 
g|z(t), where J is the interval of the piece of g. 


(b) The roots found in (a) determine at most s +1 closed non- 
degenerate subintervals of J with disjoint interiors. For 
each such subinterval K, determine which of f|x and g|x 
is minimal by comparing f(t«) and g(tx), where tx is any 
interior point of K. 


Suppose f has u pieces generated by ® and g has » pieces gen- 
erated by ®. By Lemma 2.3, the intervals of the pieces of f and 
the intervals of the pieces of g have at most u+v nondegenerate 
intersections. By Lemma 2.4 there are at most (s + 1)(u+v) 
subpieces determined in this step. Since the pieces of f and the 
pieces of g were stored one per PE, there are O(1) subpieces per 
PE. 


. Sort the subpieces (their intervals have disjoint interiors) from 


left to right so that the subpieces end up in a string such that 
each PE of the string has at least one and at most s+1 subpieces. 


. At this point, there may be adjacent subpieces with the same 


function F(t). Such pairs should be joined into a single piece. 
I.e., if there are subpieces of the form (F(t), [a, b]) and (F(t), (6, c}) 
or (F(t), [b, co)), they are joined as (F(t), [a,c]) (respectively, 
(F(t), [a,00))). Adjacent subpieces that have the same function 
may be joined by creating strings of such functions, broadcasting 
the first and last interval to all PEs in the string, letting the first 
PE in the string create a description of the combined subpiece, 
and using a parallel prefix to pack the final set of intervals. 


For the mesh: Step 1 requires O(m!/ 2) time. Steps 2 and 6 require 
©(1) time. Sorting (Steps 3 and 7) requires O(m!/?) time. Concurrent 
reads (Steps 4 and 5) require O(m1/?) time. The broadcasting in 
Step 4 requires O(m?/?) time. Step 8 can be implemented by grouping 
and parallel prefix operations, both of which require O(m1/?) time. 
Therefore, the running time of the algorithm is O(m!/?). 

For the hypercube: Step 1 requires O(logm) time. Steps 2 and 
6 require ©(1) time. Sorting (Steps 3 and 7) and concurrent reads 
(Steps 4 and 5) require O(log? m) time, expected O(log m) time. The 
broadcasting in Step 4 takes O(logm) time. Step 8 can be imple- 
mented by grouping and parallel prefix operations, both of which re- 
quire @(log” m) time, expected O(log m) time. Therefore, the running 
time of the algorithm is O(log” m), expected O(log m). Ml 


It should be noted that for some of our algorithms, running times 
for the mesh are given in O-notation, while all running times for 
the hypercube are in O-notation. This is because min{fo,..-, fra_1} 
may have less than A(n,k) pieces, in which case it may be possible 
to use a submesh and obtain asymptotically faster running times. 
The same is not true of the hypercube. Roughly, this is because 
N/2(n, k) # O(n1/?), while log A(n,k) = O(log n). 

Constructing the minimum function for a pair of functions, as de- 
scribed above, is part of a recursive algorithm for describing the func- 
tion h(t) of Equation (1). An efficient description of h(t) is obtained 
by means of the algorithm associated with Theorem 3.2. 

Since the number of PEs in a mesh must be a power of 4, define 


Ax (n, 3) —_ Allog. A(n,s)] S 
Since the number of nodes in a hypercube must be a power of 2, define 
Ah (n, 3) _ flog, A(n,s)] : 


Note Am(n, 8) > A(n,s), An(n, 8) > A(n, 8), Am(n, 8) = O(A(n, 8), 
and A,(n, s) = O(A(n, s)). 


Theorem 3.2 Let fo,...,fn—1 be continuous real-valued functions 
defined on [0,00), no distinct pair of which intersects more than s 
times. Assume a) each f; has a O(1) storage description, b) each 
fi(t) may be calculated in ©(1) time for a given t by a single pro- 
cessor, and c) for every distinct pair f; and f;, the (at most s) real 
solutions to f;(t) = f;(t) can be computed in O(1) time by a single 
processor. Suppose descriptions of fo,..., fn—1 are stored one per PE 
in a mesh with X,,(n,s) PEs or in a hypercube with r_(n, 8) PEs. 
Then the minimum function h(t) can be constructed by the mesh in 
O(A1/2(n, s)) time; by the hypercube in O(log? n) time; and by the 
hypercube in expected O(log” n) time. At the end of the algorithm, the 
description of h(t) is given with the pieces ordered by their intervals, 
one piece per PE. 


Proof: A general algorithm is given in 3 steps. 


1. Split the descriptions of { fo, fi,..., fn_1} evenly among the pro- 
cessors. 


2. Recursively, and in parallel, have the string with fo,..., fr nat) 
2 
construct the ordered pieces p,,..., py, for 


generated by {fo,...,f a=1] }, while the string with the func- 
tions fr Belgie fn—1 constructs the ordered pieces qj, ..., Q 
representing 


ho(t) = min{ f-nas141(t))--+) fn (t)} 


generated by {frazzq4i(4), +) fn—1(t)}. Since u,v < A(n/2, 8), 
then from Lemma 2.2, each of the PEs is responsible for at most 


one piece of a minimum function. 


At the end of this step, descriptions of the pieces {p,,..., pu} 
and {q1,...,Qv} are ordered by their intervals in disjoint strings, 
each consisting of half of the PEs. 


3. Describe h(t) = min{h,(t), h2(t)} by the algorithm of Lemma 3.1. 


Let T(n) be the running time of the algorithm. For the mesh we 
have the following analysis. Step 1 requires O(A'/?(n, s)) time. Step 
2 is a recursive call. Since we have u+v < 2A(n/2,8) < A(n,s), 
the latter inequality by Lemma 2.2, it follows from Lemma 3.1 that 
Step 3 requires O(A1/2(n, s)) time. Therefore, the running time of the 
algorithm satisfies the recurrence T(n) = T(n/2) + O(A1/?(n, s)). It 
follows from Lemma 2.2 that T(n) = O(A1/?(n, s)). 

For the hypercube we have the following analysis. Step 1 requires 
O(log? n) time, expected O(log n) time. Step 2 is a recursive call. 
Step 3 requires @(log? n) time, expected O(log n) time, by Lemma 3.1. 
Therefore, the running time of the algorithm satisfies the recurrence 
T(n) = T(n/2)+ O(log? n), which is O(log* n). The expected running 
time satisfies T(n) = T(n/2) + @(log n), which is O(log” n). ™ 


The function f(t) has a jump discontinuity at u if both lim,_,,+ f(t) 
and lim,_,,- f(t) exist, and lim,_,,+ f(t) # lim;_.- f(t). The real- 
valued function f(t) whose domain is a subset of [0,00) has a transi- 
tion at to [Atal85] if and only if 


e to > 0, and 
e there exists 6 > 0 such that either 


1. for all t such that 0 < t < 6, f(to—t) is defined and f(to+t) 
is undefined, or 


2. for all ¢ such that 0 < ¢ < 6, f(t) — t) is undefined and 
f(to + t) is defined. 


Lemma 3.3 [Boxe87a] Let k be a positive integer. Let fo,...,fn—1 
be real-valued functions of time such that (a) every f, 1s continu- 
ous except for at most p; jump discontinuities, (b) every f; has at 
most q; transitions, where (c) pj + qi < k, and (d) no pair of dis- 
tinct functions f; and f; intersect more than s times. Then h(t) = 
min{ fo(t),.--, fn—1(t)} has no more than A(n,s+2k) pieces generated 


by { fo, eeey fn-1}- | 


Theorem 3.4 Let k be a positive integer and let fo,..., fa-1 be as 
in Lemma 3.3. Assume also that the f; satisfy a), b), and c) of Theo- 
rem 3.2. A description of the function h(t) = min{ fo(t),..-, fn—1(t)} 
can be given in O(A'/?(n, s+2k)) time by a mesh of size Am(n, s+ 2k) 


- g0 that the pieces are ordered, one per PE. A description of the func- 
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tion h(t) can be given in O(log* n) time, and in expected O(log? n) 
time, by a hypercube of size ,(n, 3 + 2k) so that the pieces are or- 
dered, one per PE. 


Proof: The assertion may be proved by an argument that is 
virtually identical to that given for Theorem 3.2. Mf 


4 Transient Behavior Computations 


We apply the results of the previous section to dynamic systems of 
point-objects, showing how to determine geometric properties of the 


system. 


4.1 Closest Points, Farthest Points, and Collision 


Let S be asequence of points closest, say, to Po, listed in chronological 
order. That is, the first member of S is a closest point to Po at time 
t = 0, and the last member of S is a closest point to Po as t approaches 
infinity. Let S’ be a sequence of “farthest” points from Po, listed in 


chronological order. 


Theorem 4.1 For a system of n points in d-dimensional space with 
k-motion, each of S and S' can be constructed on a mesh of size 
Am(n—1, 2k) in O(A'/?2(n—1, 2k)) time; on a hypercube of size A,(n— 
1,2k) in O(log* n) time; and on a hypercube of size An(n — 1, 2k) in 
expected O(log’ n) time. 


Proof: Let doj(t) be the Euclidean distance between points Pp 
and P; at time t. Then each dj,(t) is a polynomial of degree < 2k. 

We give an algorithm in 3 steps for constructing S. A similar 
algorithm may be used to construct S’. 


1. Broadcast a description of function fo so that, without loss of 
generality, PE; has descriptions of the distinct pairs (fo, f;), 0 < 
j<n. 


2. In parallel, each processor PE; constructs the function do; (t) 
from fo and f;. 


3. Construct the min function h(t) of the family of functions dj, (t) 
by the algorithm of Theorem 3.2. For each piece of h(t), a pair 
of points that yielded the piece corresponds to an element of S. 


For the mesh we have the following analysis. Step 1 is accomplish 
in O(A1/?(n — 1, 2k)) time. Step 2 is accomplished in ©(1) time. Step 
3 requires O(A!/?(n — 1, 2k)) time by Theorem 3.2. Thus, the running 
times for the mesh are as claimed. 

For the hypercube we have the following analysis. Step 1 is accom- 
plish in O(Jogn) time. Step 2 is accomplished in O(1) time. Step 3 
requires ©(log* n) time, and expected O(log? n) time, by Theorem 3.2. 
Thus, the running times for the hypercube are as claimed. 


Sometimes it is more important to determine whether or not two 
points collide rather than which pair is closest. Define points P; and 
P; to collide at time t if and only if f;(t) = f;(t). 


Theorem 4.2 Assume that a system of n points in d-dimensional 
space with k-motion is given. Then a chronological list of times at 
which Po collides with any other point of the system can be created 
in O(n'/?) time on a mesh of size 4!'°8."1. in O(log’ n) time on a 
hypercube of size 2/827] - and in expected O(logn) time on such a 
hypercube. 


Proof: We observe that Po and P; (j > 0) collide if and only if 
d5,(t) = 0 has a solution for ¢ > 0. Without loss of generality, PE; 
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stores a description of f;,0<i<n. 
The algorithm is given in 4 steps. 


1. Broadcast to each PE a description of fo. 
2. In parallel, PE;,0 <i <n, determines d?,(t). 


3. In parallel, PE;,0 < i <n, solves d2,(t) = 0 for its at most 2k 
roats (since d?;(t) is a polynomial of degree at most 2k) that are 
greater than 0. 


4. Sort the roots to obtain the desired list. 


Each solution in Step 3 represents a collision of Po with another 
point-object of the system. 

For the mesh, we have the following analysis. Steps 1 and 4 each 
require @(n1/) time. Steps 2 and 3 each require O(1) time. Thus, 
our algorithm has O(n'/?) running time. 

For the hypercube, we have the following analysis. Step 1 requires 
O(logn) time. Steps 2 and 3 each require O(1) time. Step 4 re- 
quires O(log? n) time, expected @(logn) time. Thus, the algorithm 
has @(log’ n) running time, and expected O(log n) running time. & 


4.2 Convex Hull 


The convez hullof aset of points S = {Po,..., Pn_1}, denoted hull(S), 
is the smallest convex set containing S. A point P; € S is an eztreme 
point or vertez of hull(S) if P, ¢ hull(S — {P;}). In this section, we 
develop a general parallel algorithm to generate a description of the 
intervals of time over which a given point Po € S is an extreme point 
of hull(S). We also give efficient implementations of the algorithm for 
the mesh and hypercube. 

Assume k-motion in the plane. Let 7;;(t) be the angle made by 
rotating the positively oriented horizontal ray with endpoint P; about 
P; until the ray contains the line segment from P; to P; at time t. By 
convention, —w < Tj;(t) < x. Formally, if z;(t),2;(¢), y:(t), and y;(t) 
are the z and y coordinates of the points P; and P;, respectively, at 
time t, then 


x /2 if x,(t) = 2;(t) and y(t) < y,(t) 
—n/2 if z;(t) = 2;(t) and y;(t) > y;(t) 
arctan F ; = ; if z;(t) < 2; (t) 
FO = aca Lists) tm if a(t) > 2,(t) and y(t) < y(t) 
arctan wis —nx if a,(t) > z;(t) and y,(t) > y;(t) 
undefined if x,(t) = x;(t) and y(t) = y;(t). 
maf Ts) — fTs(t) >0 
Define Gi;(t) = { undefined otherwise. 
vw 4 Lath) if T,;(t) < 0 
Define B,;(t) = { undefined otherwise. 


Define the functions a;,6;,c;, and d; as follows. 
a;(t) = min{G;;(t)|0 < j < n,t $ j, Gj; (t) is defined}. 
b;(t) = max{G;,;(t)|0 <pcntFy, G;;(t) is defined}. 
ci(t) = min{ B;;(t)|0 < j < n,i $j, B;;(t) is defined}. 
d;(t) = max{B;;(t)|0 < j < n,i # j, Bj; (t) is defined}. 


If at time t, G,;(t) is undefined (respectively, B;;(¢) is undefined) 
for all 7, then a;(t) and 6;(¢) (respectively, c;(t) and d;(t)) are unde- 
fined. 


Define T = {7 ;|0 < j < n}. 


Lemma 4.3 [Atal85], [Boxe87b] For a system of n points with k- 
motion, each of the functions ag, bo, co, and do has at most X(n, 4k) 
pteces generated by T. @ 


Lemma 4.4 [Atal85] Given a set S of n points moving in the plane, 
a point Po is an extreme point of hull(S) at time t if and only if 


1. ao(t) — do(t) > x, or 
2. bo(t) — co(t) < x, or 
3. ao(t) and bo(t) are undefined, or 
4. Co(t) and do(t) are undefined. & 


Theorem 4.5 Let S = {Po,...,Pn-1} be a set of points in the plane 
with k-motion. Then the ordered intervals of time during which a 
given point Py is an extreme point of hull(S) can be determined in 
O(A1/2(n, 4k)) time on a mesh of size Am(n, 4k); in O(log* n) time 
on a hypercube of size x(n, 4k); and in expected O(log” n) time on a 
_ hypercube of size An(n, 4k). 


Proof: For each j,0 < j < n, let z;(t) be the z-coordinate of P; at 
time ¢ and let y;(t) be the y-coordinate of P; at time t. Observe that 
solving To;(¢) = Tom(t) means finding instants at which the directed 
line segment from Po to P; and the directed line segment from Po 
_ to Py have equal slopes and are similarly oriented. Finding instants 
when the line segments have equal slopes can be determined by solving 


[tm(t) — zo(#)][y;(t) — yo(t)] = [z; (t) — zo(t)][ym(t) — yo(t)], 


a polynomial equation of degree at most 2k, which we assume can be 
solved in O(1) time by a single PE. Further, determining whether or 
not two directed line segments with equal slopes are similarly oriented 
can be accomplished in O(1) serial time. It follows that To;(t) = 
Tom(t) can be solved by a single processor in ©(1) time. 
Define £ a(t) () 
= 1 i ao t) — do{t = v 
Ao(t) = { 0 otherwise, 
1 if bo(t) = co(t) < ns 
0 otherwise, 


Bo(t) = { 


1 if both ao(t) and bo(t) are undefined 
0 otherwise, 


Co(t) = { 
and 


1 if both co(t) and do(t) are undefined 


0 otherwise. 


Our general algorithm is given below. 


1. It is shown in the proof of Lemma 4.3 [Boxe87b] that each 
Go; (similarly, each Bo;) has at most k values of t that yield 
jump discontinuities or transitions. 


ao(t), bo(t), co(t), and do(t). 


2. From Lemma 4.3 and Lemma 2.3, each of ao(t) — do(t) and 
bo(t) — co(t) has no more than 2A(n,4k) = O(A(n,4k)) pieces 
generated by differences of members of T. Construct the ordered 


Construct the functions 


pieces of the functions ag(t) — do(t) and bo(t) — co(t). Similarly, 
construct the O(A\(n,4k)) ordered maximal intervals on which 
ao(t) and bo(t) are both undefined (respectively, on which co(t) 
and do(t) are both undefined). 


3. If I, and Zz are intervals of pieces of ag and do, respectively, 
where I = I, NI is nondegenerate, then (a9 — do)|z(t) = a im- 
plies there are integers 7 and m determined by J; and Iz, respec- 
tively, such that ao|r = T;,do|r = Tom, and To;(t) — Tom(t) = 
x. There are at most 2k such instants, and they may be de- 
termined in O(1) time. It follows from Lemma 2.4 that ev- 
ery piece of ao(t) — do(t) generated by differences of members 
of T yields at most 2k + 1 pieces of Ao(t) generated by the 
set of constant functions {0,1}. Therefore, Ao(t) has at most 
(2k+1) 2A(n, 4k) = O(A(n, 4k)) pieces generated by {0,1}. Sim- 
ilarly, Bo(t) has O(A(n,4k)) pieces generated by {0,1}. Con- 
struct descriptions of the functions Ao(t) and Bo(t) by using the 
algorithm of Lemma 3.1. Similarly, construct the O(A(n, 4k)) 
ordered pieces generated by {0, 1} of each of the functions Co(t) 
and Do(t). | 


. It follows from Lemma 2.4 that there are O(A(n,4k)) pieces 
generated by {0, 1} of 


Ho(t) = max{Ao(é), Bo(t), Co(t), Do(t)}- 


Describe Ho(t) via a fixed number of applications of the algo- 
rithm of Lemma 3.1. Note that Lemma 4.4 implies Po is an 
extreme point at time ¢ if and only if Ho(t) = 1. 


5. Pack the intervals for which Ho(t) = 1 into a string by sorting 
in order to obtain the desired sequence of intervals. 


For the mesh: Step 1 requires O(A1/?(n, 4k)) time, by Theorem 3.4. 
Steps 2, 3, and 4 each require O(A!/?(n, 4k)) time, by Lemma 3.1. 
Step 5 requires O(A!/ 2(n,4k)) time. Hence the running time of the 
algorithm is O(A!/?(n, 4k)). 

For the hypercube: Step 1 needs @(log® n) time, expected O(log” n) 
time, by Theorem 3.4. Steps 2, 3, and 4 each needs O(log? n) time, 
by Lemma 3.1. Step 5 needs O(log? n) time. Hence the running time 
of the algorithm is ©(log* n), expected O(log? n). ™ 


4.3 Containment Problems 


In this section, we address a variety of problems concerning shapes 
and sizes of containers into which a dynamic system of points will fit. 
We assume k-motion in d-dimensional space, for fixed k and d. 

Let J be the ordered list of intervals of time during which the 
points Po, ...,Pn—1 can be enclosed within a rectilinear, iso-oriented 
hyperrectangle (a d-dimensional analog of a box with sides parallel or 
perpendicular to each of the coordinate axes) of given fixed dimen- 


sions. 


Theorem 4.6 For a set of n points with k-motion in d-dimensional 
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space, the ordered sequence J can be constructed on a mesh of size 
Am(n,k) in O(A/2(n,k)) time; on a hypercube of size A,(n,k) in 
O(log* n) time; on a hypercube of size A,,(n,k) in expected @(log? n) 
time. 


Proof: For 1< i < d, let p, : IR? + R be the i** coordinate 


function. That is, for a point X = (a1,...,24), pi(X) = a. Note 
that for each i and j, a description of p,(f;(t)) is stored in the PE 
containing a description of f;. Define 


m,(t) = min{p;(fo(t)), eee »Pi(fn—1(t))}, 


and 
M;(t) = max{p;(fo(t)), ..-, pi(fa—1(t))}- 


Our algorithm is given below. 


1. Describe the functions m,(t),..., ma(t), Mi (t),..., Ma(t) using 
the algorithm of Theorem 3.2, such that each of m; and M,, 
1 <i < d, has at most A(n,k) pieces generated by F, = 
{pi(fo(t)), --- 1 Pi(fn—1(t))}- 

. Construct descriptions of all the functions D;(t) = M,(t)—m,(t), 
1 <i < d, (D,(t) is the maximum separation in the z** coordi- 
nate among the points {Po,..., P,_1} at time t), by the algo- 
rithm of Lemma 3.1. It follows from Lemma 2.3 that each D;(t), 
1<i<d, has at most 2X(n,k) pieces generated by differences 
of pairs of members of F;. 


. Let X; be the length of the hyperrectangle in the i‘? coordinate, 
1<1t<d. Then for each 7,1 <2 <d, the function 


mii) =| 1 if D;(t) < X; 


0 otherwise 
has at most 2(k + 1)A(n,k) pieces generated by the set of con- 
stant functions {0,1}, by Lemma 2.4. Describe W1(t),..., Wa(t) 
by the algorithm of Lemma 3.1. 


. Let C(t) = min{W,(t),...,Wa(t)}. Notice that C(t) = 1 if 
and only if {Po,...,P,-1} will fit inside a hyperrectangle of 
the given fixed dimensions at time t. Describe C(t) from de- 
scriptions of the set of functions {W,(t),...,Wa(t)}. This is 
accomplished by performing O(log d) = O(1) stages of the algo- 
rithm of Lemma 3.1, where at each stage, pairs of functions are 
combined. 


. The (ordered) intervals of the pieces of C(t) for which C(t) = 1 
form the desired sequence J. Pack these intervals into a string 
via a sorting operation. 


For the mesh: Step 1 requires O(\!/?(n, k)) time by Theorem 3.2. 
Steps 2, 3, and 4 require O(A1/?(n, k)) time, by Lemma 3.1. Step 5 re- 
quires O(A1/?(n,k)) time. Thus the algorithm requires O(A'/?(n, k)) 
time. 

For the hypercube: Step 1 uses @(log* n) time, expected O(log” n) 
time. Steps 2, 3, and 4 use O(log” n) time, by Lemma 3.1. Step 5 uses 
O(log? n) time. Thus the algorithm requires O(log® n) time, expected 
O(log? n) time. 

Define 

W= {ri ( F(t) sis d0<j <n}. 


Theorem 4.7 Assume a system of points S = {Ppo,..., Pn_1} has 
k-motion in d-dimensional space. The function D(t), whose value at 
tame t 18 the edgelength of the smallest iso-ortented rectilinear hyper- 
cube that will contain S, has O(A(n,k)) pieces generated by differences 
of members of ¥. A description of D(t) can be constructed in an or- 
dered fashion on a mesh of size Xm(n,k) in O(A'/?(n, k)) time; on a 
hypercube of size An(n,k) in O(log* n) time; and on a hypercube of 
size Ax(n,k) in expected O(log? n) time. 
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Proof: We give a general algorithm of 2 steps. 


1. Let D,(t),..., Da(t) be as in Theorem 4.6. Construct descrip- 
tions of D,(t),...,Da(t) via the algorithm of Theorem 4.6. It 
was shown in the proof of Theorem 4.6 that each of these func- 
tions has at most 2X(n,k) pieces generated by differences of 


members of ¥. 


. Since D(t) = max{D,(t),..., Da(t)}, observe D(t) can be de- 
scribed from D,(t),...,Da(t) by performing O(logd) = ©(1) 
stages of the algorithm of Lemma 3.1, where at each stage, or- 
dered pieces of pairs of functions are combined. If each func- 
tion being combined has no more than cX(n,k) pieces, c a con- 
stant, then the maximum of the two functions has no more than 
2c(k + 1)A(n, k) pieces, by Lemma 2.3 and Lemma 2.4. 


Since each of the ©(1) combine steps increases the number of pieces 
by no more than a constant factor, the number of pieces of D(t) is 
O(A(n, k)). 

Our claims concerning the running times follow from Theorem 4.6 
and Lemma 3.1. @ 


We now show how to determine a smallest iso-oriented rectilinear 


hypercube that can ever contain the set of points S = {Po,..., Pn—-1}- 


Theorem 4.8 Let S be a system of n points with k-motion in Eu- 
clidean d-dimensional space. Let Dmin = min{D(t)|t > 0}, where 
D(t) is as in Theorem 4.7. Then Din and a time tmin at which 
D(tmin) = Dmin can be computed in O(a 2(n,k)) time on a mesh of 
size Am(n,k); in O(log® n) time on a hypercube of size A»(n,k); and 
in expected O(log” n) time on a hypercube of size ;,(n,k). 


Proof: We give a general algorithm of 4 steps. 


1. Construct a description of D(t) by Theorem 4.7. The function 
D(t) has O(A(n,k)) pieces generated by differences of members 
of Y, so that responsibility for the pieces is divided evenly among 
the PEs in that each PE is responsible for O(1) pieces of D(t). 


. In parallel, each PE does the following. For each of the pieces 
of D(t) for which the PE is responsible, compute the minimum 
value of D(t) on the interval of the piece. 


. In parallel, each PE determines the minimum of its pieces’ min- 


ima, and a time when the minimum occurs. 


. Compute the minimum of all the PEs’ minima, Dmin, with a 


minimizing time being recorded. 


For the mesh: Step 1 requires O(A1/?(n, k)) time, by Theorem 4.7. 
Step 2 may be done, using well-known principles of calculus, in O(1) 
time. Step 3 requires ©(1) time. Step 4 requires O(A1/?(n, k)) time. 
Hence the algorithm requires O(A1/?(n, k)) time. 

For the hypercube: Step 1 uses O(log* n) time, expected O(log? n) 
time, by Theorem 4.7. As with the mesh, Step 2 requires O(1) time. 
Step 3 requires O(1) time. Step 4 requires O(log n) time. Hence the 
algorithm requires O(log* n) and expected O(log? n) time. ™ 


5 Steady-State Computations 


We use the term “steady-state” to refer to conditions as ¢ (time) 
approaches infinity. In this section, we give parallel algorithms for 


determining steady-state properties of dynamic systems, mostly in 
the plane. Due to space limitations, we omit the proofs. 

The algorithms in this section are adapted from [Boxe87a, Mill86, 
Mill88a, Mill88b, Mill88c, Sanz87, Sham75]. 


Theorem 5.1 Letk andd be fized integers such that k > 0 andd > 0. 
Given a set of points {Po,..., Pa-1} with k-motion in d-dimensional 
space, a steady-state nearest (farthest) neighbor to a given point Po 
can be determined on a mesh of size 4!'°8s"] in ©(n!/?) time and on 
a hypercube of size 2!1°82"1 in O(log n) time. 


Theorem 5.2 Fora system of n points with k-motion in the plane, a 
steady-state closest pair can be identified by a mesh of size 4/184" in 
O(n'/?) time; by a hypercube of size 2/82"! in O(log? n) time; and 
by a hypercube of size 2/1°82"| in expected O(log n) time. @ 


Theorem 5.3 Fora system S of n points with k-motion in the plane, 
the steady-state hull(S) can be constructed by a mesh of size 4/18”! 
in O(n'/?) time; by a hypercube of size 2!'°82"1 in O(log” n) time; and 
by a hypercube of size 2!°82"! in expected O(log n) time. Ml 


Theorem 5.4 Let an integer k > 0 be fized. Let S = {Po,..., Pn_i} 
be a set of point-objects moving in the plane with k-motion such that 
when steady-state is reached, S is the set of distinct extreme points of 
a convez polygon C. The diameter function of C may be determined 
in Q(n1/2) time by a mesh of size 4[°84"1 ; in O(log” n) time by a hy- 
percube of size 2!!°82"] ; and in expected O(logn) time by a hypercube 
of size 2/!osa"], my 


Corollary 5.5 Fora set S of n points with k-motion in the plane, a 
steady-state farthest pair can be determined by a mesh of size 4/871 
in O(n1/ 2) time, by a hypercube of size 2llogs”] in O(log” n) time, and 
by a hypercube of size 2/982") in expected O(logn) time. Hl 


Theorem 5.6 LetS = {Po, Pi,...,; Pa_1} be a system of point-objects 
in k-motion, where k ts a fized nonnegative integer, such that, in 
steady-state, S ts the set of distinct extreme points of a convez poly- 
gon C. A rectangle of mintmum area enclosing all the points of C 
may be determined by a mesh of size 4/8"! in ©(n1/?) time; by a 
hypercube of size 2/!°82"! in O(log? n) time; and by a hypercube of 
size 2!!083"1 in expected O(logn) time. Ml 


Corollary 5.7 Letk be a fized nonnegative integer. Let S be a system 
of n point-objects in k-motion. Then a description of a steady-state 
minimal-area rectangle enclosing all the points of S can be given by a 
mesh of size 4!1°84"1 in O(n1/?) time; by a hypercube of size 2/8271 
in O(log? n) time; and by a hypercube of size 2!'°82"] in expected 
O(logn) time. BO | 
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ABSTRACT 


Many multiprocessor systems based on _ the 
hypercube or binary n-cube topology have been built 
recently. In, such systems, I/O processors are used to 
handle the data transfer between the processors and the 
outside world or the host. In this paper, we propose a 
method of embedding the I/O processors in such a 
system. The proposed method is shown to require far 
fewer links than the earlier methods. It is also shown 
that the new method achieves a higher I/O adjacency, 
and as a result higher degree of tolerance of I/O failures. 
Necessary and sufficient conditions are derived to obtain 
such an embedding. A generalization of the problem to 
k-regular networks is presented and necessary conditions 
are derived for I/O embedding in such a network. It is 
shown that embedding in a general k-regular network is 
NP-complete. An algorithm is presented for finding a 
minimal embedding in a k-regular network. 


1. INTRODUCTION 


A binary n-cube consists of 2” processors intercon- 
nected in an n-dimensional binary cube topology. Each 
processor in such a system has its own memory and the 
processors communicate by message passing. A processor 
in an n-cube can be represented by an n-bit string, 
P1P2°°°Pn- Each processor is adjacent to a processor 
along the n dimensions. Specifically, a processor 
(PiP2°** Pn) is adjacent to (Dip2°*** Pa), P1D2°** Pad; 
.(P)1P2 °° * Dn). Binary cubes are known to have several 
useful properties, namely a high degree of connectivity, 
fault tolerance, low diameter, etc. Several problems such 
as sorting, FFT are known to map well on to the hyper- 
cubes. Different interconnections such as a linear array, 
mesh are known to map easily onto the hypercube inter- 
connection. Several hypercube systems are built recently 
[1,2,3,4]. One issue that has not been addressed 
effectively in the past is the support of efficient I/O 
operations in multiprocessors. The importance of balanc- 
ing I/O bandwidth and computational power has been 
pointed out by Kung [5]. 


° This research was supported in part by the National Science Foundation 
Presidential Young Investigator Award under Grant NSF MIP 86-57563 PYI, 
and in part by the Semiconductor Research Corporation under Contract SRC 
86-12-109. 


331 


Santosh G. Abraham 


Dept. of Elec. Engg. & Comp: Sct. 
University of Michigan 
1301 Beal Ave. 
Ann Arbor, MI 48109 


I/O processors are used for transferring data 
between the hypercube nodes and the outside world and 
the host. This is to be distinguished from the I/O 
hardware that is required for communication between 
the processors. I/O communication is required to distri- 
bute data and code to the processors before the computa- 
tion and to receive the results after the computation has 
been completed. Each processor in the system is con- 
nected to an I/O processor and the I/O processor handles 
all the data transfer between that processor and the out- 
side world. In this paper, we propose a method to connect 
the I/O processors and processors together to build a sys- 
tem with a higher degree of I/O adjacency and a higher 
degree of fault tolerance. The Intel iPSC system uses I/O 
hardware within each processor for I/O communication 
using the ethernet protocol[2]. In the NCUBE system, an 
I/O processor is connected to a subcube of 8 processors 
and the I/O processors are themselves interconnected par- 
tially [1]. Our method uses the system links efficiently 
for both the I/O and processor-to-processor communica- 
tion. The proposed method requires no_ explicit 
processor-to-I/O processor links. It is also shown that our 
method achieves a higher processor-to-I/O processor adja- 
cency and as a result higher tolerance of I/O failures. In 
a related recent work, the hypernet architecture has been 
proposed [6] for maintaining a constant node degree. The 
hypernet architecture provides a set of nodes explicitly 
meant for performing I/O operations in a concurrent 
manner. However, that work on I/O embedding is appli- 
cable only to the hypernet topology and the I/O embed- 
ding itself has not been evaluated. Our method can be 
directly used in the hypercubes with minor modifications 
to the architecture. 


Embedding general graphs onto the hypercube topol- 
ogy is known to be an NP-complete problem [7]. Embed- 
ding arbitrary meshes onto hypercubes is reported in [8]. 
Several numerical computations have been mapped onto 
the hypercube topology [9, 10]. 


The new method of connecting I/O processors and 
processors is presented in Section 2. Necessary and 
sufficient conditions are derived for embedding I/O pro- 
cessors in such a manner. Section 3 looks at the advan- 
tages and disadvantages of the proposed method. In Sec- 
tion 4, we consider a generalization of the proposed 
method to a k-regular interconnection network and 
derive some necessary conditions. 


2. 1/O EMBEDDING 


DEFINITION 1: I/O embedding is the problem of 
mapping I/O processors onto a multiprocessor system 
such that each processor in the system is adjacent to at 
least one I/O processor. 


Consider embedding I/O processors in a k-cube mul- 
tiprocessor. The I/O processors are embedded in the sys- 
tem along with the processors. The system will then con- 
sist of two types of nodes 1) processor nodes, P-nodes 
and 2) I/O processor and processor nodes, I-nodes. The 
processor and I/O processor share the links to the I-node 
through a switch as shown in Fig.1. The switch can con- 
nect any of the system links ‘to the processor or I/O pro- 
cessor and the processor to the I/O processor. The impli- 
cations of sharing the links by the processor and the I/O 
processor are discussed in Section 3. But we briefly note 
that in a model of computation where each node receives 
the data it has to operate on and after carrying out the 
computation, sends the results back to the host, I/O com- 
munication and interprocessor communication do not 
overlap. Hence, sharing links does not result in conges- 
tion. Since each I-node also-contains a processor, the cube 


| A eee ee 


I/O proc. 


I-node P-node 


Fig.1. Architecture of the nodes. 


topology of the multiprocessor is not disturbed and hence 
the proposed architecture will retain all the desirable 
properties of a binary cube such as ease of problem map- 
ping. Each node in a k-cube has k neighbors. Hence each 
I-node is adjacent to k nodes. So an I/O processor in an 
I-node can serve as an I/O processor to its k neighbors 
and. the processor in its node. A processor sends data 
along the system links with an appropriate tag to indicate 
whether the data is meant for the processor or the I/O 
processor at that node. At an I-node, the data is appropri- 
ately switched to the processor or the I/O processor 
depending on the tag. By embedding enough I-nodes 
through the hypercube, we can make sure that each pro- 
cessor is adjacent to at least..one I/O processor. The 
advantage of embedding the I/O processors in such a way 
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is that we do not require explicit processor-to-I/O proces- 
sor links. As a result, we can construct a larger size 
hypercube given the same number of links for processor 
communication. For example, in the Intel iPSC, there are 
eight channels at each processing node, seven of which are 
used to connect 128 processors, the eighth to connect 
every node to the host. Using our scheme, we can connect 
a 256 processor cube using the same number of channels. 


Each node in the cube has a register associated with it 
which indicates along which dimension it is adjacent to 
an I/O processor. Whenever the processor has to transfer 
I/O data, the processor sends the data along this dimen- 
sion with an appropriate tag to indicate whether it is I/O 
data or a message to the processor at that node. The 
hypernet architecture [6] also uses two types of nodes, 
but the I/O nodes in their architecture are only used for 
I/O communication and the architecture is concerned 
with maintaining a constant node degree. 


DEFINITION 2: A _ network is said to have a 
perfect embedding if it is possible to embed the I-nodes in 


the network such that each processor in the system is 
adjacent to exactly one I/O processor. 


In the remainder of the section we will characterize 


when a k-cube can have a perfect embedding. 


LEMMA 1: For a k-cube to have perfect embedding, 
k = 2' — 1, for some integer L. 


PROOF: Each I-node in the system is adjacent to k 
processors along the k dimensions of the cube. Moreover, 
the I/O processor in an I-node also serves the processor in 
its node. Hence, each I/O processor serves (Kk +1) proces- 
sors in the cube. Then, if a perfect embedding exists, 
(k+1) should divide the number of processors n = 2* in 
the system. Since the number of processors in a hyper- 
cube system is always a power of 2, the only factors of n 
are some smaller powers of 2. This implies that (kK+1) = 
2', for some L. oO 


From Lemma 1, for a perfect embedding to exist k = 
3, 7, 15,.... A perfect embedding for a 3-cube is shown in 
Fig.2. A perfect embedding for a 7-cube, with processors 
numbered from O to 127, can be obtained by locating the 
I-nodes at 0, 7, 25, 30, 42, 45, 51, 52, 75, 76, 82, 85, 97, 
102, 120, 127. The next question one would like to ask 
is, can we always find a perfect embedding in a k-cube 
with k = 2‘—1. In other words, is this condition sufficient? 
The answer to this question is yes and this leads us to the 
following lemma: 


LEMMA 2: A perfect embedding exists in a k-cube 
when k = 2'—1. 


PROOF: Consider two I-nodes a and b in a perfect 
embedding. Then nodes a and b cannot be adjacent. If 
they are adjacent, then the processors in nodes a and 6 are 
each adjacent to two I/O processors. By a similar argu- 
ment, the neighbors of a cannot be neighbors of 6. This 
implies that any two I-nodes in a perfect embedding have 
to be at a Hamming distance of 3 or greater. 


The result follows from the theory of single error 
correcting Hamming codes. Each code word is at a Ham- 
ming distance of 3 or greater from each other. Hamming 
codes are known to be perfect codes [11], ie., they attain 


the upper bound of distance-3 codes that can be found in 
a given space. With k=2'—1 bits, the number of 


distance-3 code words is bounded by wai and Ham- 


ming codes achieve this bound. In such a code space, each 
non-code word is adjacent to exactly one code word. 
Now the hypercube can be seen as a k-dimensional space 
and we can place I-nodes in the code word locations. Each 
non-code word can be a P-node and we have the required 
perfect embedding in the k-cube when k = 2’—1. This 
completes the proof. 0 


THEOREM 1: In a k-cube, k = 2'—1 is a necessary and 
sufficient condition to find a perfect embedding. 


PROOF: From Lemmas 1 and 2. Oo 


Fig.2. A perfect embedding in a 3-cube. 


The constructive proof of Lemma 2 gives us a way 
of finding a perfect embedding in a k-cube. One could also 


use a method of sieves [12] to find the locations of the I- 
nodes. In such a method, we choose a node in the hyper- 
cube space and eliminate (sieve out) all its neighbors and 
distance-2 neighbors from the list. Then pick another 
node from the list and continue this process till the list is 
empty. This method gives a similar embedding that is 
obtained by the Hamming codes approach when a perfect 
embedding exists. Moreover, this method gives an I/O 
embedding for any k-cube, even when k~2'—1, the 
obtained embedding is perfect only if k = 2’—1. 


By shifting each I-node location in a fixed direction, 
we can obtain another perfect embedding of a given cube. 
For example, 0 and 7 is a perfect embedding in a 3-cube. 
If we shift the I-node along 1-dimension to location 1, by 
suitably shifting 7 to location 6, we have another embed- 
ding with 1 and 6. We can similarly shift the perfect 
embedding 1 and 6 to another perfect embedding such as 
3 and 4. Hence, we can find a perfect embedding that con- 
tains a given node of a hypercube. 


DEFINITION 3: If we specify that a particular node i 


has to be in the I/O embedding, we call such an embed- 
ding i—specified embedding. 
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If we get a perfect embedding only for a few sizes of 
the cube, how do we embed the I/O processors in cubes of 
other sizes? For example, consider a 4-cube. We can view 
the 4-cube as a union of two 3-cubes. Use perfect embed- 
dings in each subcube to obtain an embedding in the 4- 
cube. Such an embedding for a 4-cube is shown in Fig.3. 
We notice from the figure that some of the nodes 
(2,5,8,15) in the 4-cube are adjacent to two I/O proces- 
sors. Though we still have the same ratio of I/O proces- 
sors (1 for every 4 processors), we obtained a higher I/O 
adjacency for some of the processors. This is due to the 
fourth dimension links in the 4-cube. Similarly, if we 
embed processors in a 5-cube, more processors will have 
an I/O adjacency of two. The higher I/O adjacency can be 
useful in several ways: in increasing the tolerance of the 
I/O failures and decreasing; the :possibility of congestion 
at an I-node etc. We can carry out this procedure for any 
size cube to obtain an I/O: embedding. This leads us to 
the question: can we embed the I/O processors systemat- 
ically to get an I/O adjacency of two for each processor? 
The answer to this question is yes and we give an algo- 
rithm to obtain an I/O adjacency of two. 


Fig. 3. An embedding in a 4-cube. 


Consider I/O embedding in a 2k+1-cube, where k = 
2'-1. A 2k+1-cube consists of 2*+! k-cubes. Let these 
subcubes be represented by So,5Sj,...,S2+i1_1. When k = 


2'—1, we can obtain a perfect embedding of a k-cube. Let 
the nodes in each subcube be numbered 0,1,2...2*—1, with 
node i in subcube S, being the node numbered i+j*2* in 
the 2k+1-cube. Each subcube of size k can be embedded 
perfectly by Lemmas 1 and 2. To obtain an embedding in 
the 2k+1-cube with an I/O adjacency of two, get an i- 
specified embedding of subcubes S; and S;42*, for i 
0,1,...2*—1. Within each subcube, each node is adjacent 
to exactly one I-node. Now consider nodes within So 
with nodes numbered from 0,1,...2*. Node 1 is adjacent 
to an I-node in the subcube S,, since it is 1-specified. 
Similarly each P-node in the subcube So is adjacent to 
one I-node within the subcube and one I-node in some 
other subcube. Since the cube is symmetric, each subcube 
has a similar property and hence each P-node in the 
2k+1-cube is adjacent to two I-nodes. Since we have 
similar embedding in cubes S; and S; 49k, each I-node is 


adjacent to another I-node. This completes the proof that 
the given algorithm generates an embedding with an I/O 
adjacency of two. For each node to have an I/O adjacency 
of two the size of the cube also has to be of the form 
2'—1. For example, in a 7-cube, each subcube of size 3 
has I-nodes along a diagonal and this diagonal is oriented 
in different directions in different 3-cubes to get a sym- 
metrical embedding with an I/O adjacency of two. A 0- 
specified embedding of So gives (0,7), a 1-specified 


embedding of S, gives (9,14) and so on. By continuing a 
detailed enumeration, we obtain the following I/O 
embedding for the 7-cube with an I/O adjacency of two: 
(0,7), (9,14), (18,21), (27,28), (36, 35), (45,42), (54,49), 
(63,56) and (64,71), (73,78), (82,85), (91,92), (100,99), 
(109,106), (118,113), (127,120). We can carry this idea 
further to get embeddings with higher I/O adjacency. 


Another simpler construction exists which gives an 
I/O adjacency of two. Consider a k-cube with k = 2'-1. 
Find a perfect embedding of the k-cube. From Lemmas 1 
and 2, such an embedding exists and we can find it by the 
method of sieves or the Hamming code method. Each 
node in the cube is adjacent to one of the I-nodes now. 
Obtain a new embedding by shifting one of the I-nodes. 
Each processor is again adjacent to one I-node in the 
second embedding as well. Hence, by overlapping two 
embeddings we can obtain an embedding where each node 
in the cube is adjacent to two I-nodes in the cube. 


It is to be noted that the two methods described 
above obtain different embeddings, both with an I/O 
adjacency of two. The difference in the two methods of 
construction is that the first method starts building from 
a subcube, whereas the second method treats the cube as 
a whole. The first method is well suited for expanding an 
existing system. It is also to be noted that with one I/O 
processor for every four processors, a 3-cube has an I/O 
adjacency of one and a 7-cube has an I/O adjacency of 
two. If one interconnects I/O processors and processors in 
the way as it is done in NCUBE, the adjacency remains 
the same even in higher dimensional cubes. 


3. PRACTICAL CONSIDERATIONS 


By placing the I/O processor along with the proces- 
sor in a node, we do not. require explicit I/O processor- 
to-processor links. Thus saved links can be utilized in 
building a larger size cube. For a given number of links 
per processor, the proposed method enables connecting a 
larger number of nodes together. As we observed, because 
of utilizing the system links, as we go to higher dimen- 
sions we can achieve a higher I/O adjacency. Higher adja- 
cency implies higher tolerance of I/O failures. Each I/O 
processor now needs links only for interconnecting with 
other I/O processors. Since the I/O processors are 


connected to the processors by the system links, we 
would need fewer links per I/O processor compared to 
other designs. The links on an I/O processor are only 
needed for inter-I/O processor communication. In other 
words, with the same number of links per I/O processor, 
we can achieve a higher connectivity between the I/O 
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processors. Since each processor is adjacent to at least one 
I/O processor, an I/O operation would involve a single 
message transfer. By appropriately interconnecting the 
I/O processors, we can reduce the inter I/O processor 
traffic. This design can be seen as an integrated system 
design consisting of the processors and the I/O processors. 
This is to be contrasted with designing the multiprocessor 
system separately and then connecting some I/O proces- 
sors to it. 


Utilizing the system links for I/O transfer requires 
some consideration. We might create congestion along the 
links when I/O and interprocessor communication have 
to take place along the same link at the same time. There 
are two reasons to believe that sharing the links for I/O 
and interprocessor communication does not lead to 
congestion or a bottleneck. Most problems are solved on a 
multiprocessor system in the following manner: 1) Dis- 
tribute the data and code to each processor (2) carry out 
the computation in a cooperative manner and (3) combine 
the results together. Steps 1 and 3 are I/O communication 
and Step 2 requires computation and interprocessor com- 
munication. Within such a model of solving a problem, 
we can see that the I/O communication and interproces- 
sor communication do not overlap in time. And this leads 
us to conclude that the system links can be efficiently 
shared for both I/O communication and interprocessor 
communication. Besides, the I/O bandwidth required for 
most problems is about an order of magnitude smaller 
than that of interprocessor communication. Hence, even 
when I/O communication may overlap with interproces- 
sor communication, this may not lead to a severe strain 
on the resources. Another solution can be put forward 
for this problem. Give intérprocessor communication 
priority over I/O communication such that the computa- 
tion may not be delayed even in the presence of I/O com- 
munication. When the computation is completed, all the 
processors need to send the data to an I/O processor. Since 
each processor uses a distinct link for I/O communica- 
tion, the links will not be congested. However, the I/O 
processor may not be able to receive all the computed 
values from all its neighbors at the same time. This 
bottleneck is unavoidable and exists in other designs as 


well [1,2]. 


A perfect embedding of a 3-cube gives one I/O pro- 
cessor for every four processors in the system and simi- 
larly a perfect embedding of a 7-cube gives one I/O pro- 
cessor for every eight processors. Higher ratios can be 
easily obtained by superimposing two or more perfect 
embeddings as explained in Section 2. The ratio of I/O 
processors to processors in other size cubes depends on 
the subcube size (3 or 7) chosen for embedding such a 
cube. Smaller ratios than 1 out of 4 cannot be obtained 
for cubes of size smaller than 2’. | 


Since each subcube of size 3 or 7, depending on the 
system size, looks like any other subcube, the design can 
be made uniform by designing one board for a subcube. 
The boards can be connected together to form higher size 
cubes. The required design effort may be more than that 
of the other designs because of the non-uniformity 
among the nodes. The switch hardware needed in an I- 
node is an overhead for this organization. 


A comparison of different I/O embeddings in a k- 
cube is made in Table 1. In the NCUBE design, there 
exists one I/O processor for every eight nodes. The 
number of links in a k-cube = k2*~1, which is also the 
number of links in the proposed method. In NCUBE and 
iPSC designs, there is an extra link per processor for I/O 
and. hence the extra 2* links. The iPSC system uses a bus 
to connect all the I/O processors and hence its I/O 
bandwidth is a constant, whereas the NCUBE design and 
the proposed method enable connecting multiple I/O dev- 
ices to each I/O processor and hence the bandwidth is 
equal to the number of I/O processors. It is seen from the 
table that the proposed method achieves same or higher 
bandwidth with fewer links. This is achieved by efficient 
utilization of the links. 


Table 1. Comparison of Different I/O embeddings 


# of 
| I/O proc. | 


4, POSSIBLE PERFORMANCE 
IMPROVEMENTS 


In this Section, we will consider how the proposed 
I/O embedding may affect the performance of the system. 
The performance improvements depend whether the 
problem is I/O bound or computation bound. Clearly, the 
performance improvements will be more pronounced in 


an I/O bound problem. We consider an I/O bound prob- 


lem, matrix-vector multiplication to show the possible 
performance improvements by using concurrent I/O. 


Consider multiplying a matrix A,,, by a vector B, 
to generate C,. Algorithm mapping depends on the 
number of processors in the system, size of available 
memory at each node and the size of the problem. 
Assume that the size of the matrix and the system are 
such that, p rows of A and the vector B are mapped to 
each node. Assume that the maximum message size is 
such that we can send at most k rows as a single message. 
Then the total number of messages sent by the host is 
given by 1 = [p/k|*m, where m is the number of processors 
in the system. The number p is a function of various 
parameters, available memory size at each node, number 
of processors in the system, size of the problem etc. and 
has to be chosen carefully to minimize the execution time 
for the problem. If memory at the nodes is not a problem, 
then we can make p=n/m. Then, the data distribution 
time is given by, O(Z). The computation time is given by, 
O(n?/m). Once the computation is complete, in a single 
host system, the data needs to be collected at one site and 
this again incurs an O(Z) cost. If m =n, then / =n. If the 
relative cost of a message transmission is a compared to a 
unit of computation (an addition and a multiplication 
here), then the total cost of the algorithm is given by, n + 
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a (2n). When we use an I/O embedding as described in 
the paper, the second term in the above cost can be 
reduced significantly. Since, we have m/4 = n/4 I/O dev- 
ices, the effective cost of the algorithm will be n + a (2 * 
4). This is based on the assumption that the data was 
initially distributed among the.I/O devices such that all 
the I/O devices can distribute data in parallel to their 
neighboring processors. The speedup obtained by con- 


n+a(2n 
. With n = ,a= 
current I/O is then given by aaa With n = 128, a 
8, we get a speedup of 17 * 128 /192 = 11.33. This 
speedup factor is an optimistic measure of improvement 


because of concurrent I/O. However, this measure does 
reflect the importance of using concurrent I/O. 


It is to be noted that, in the above calculation, we 
assumed that the data is initially stored on the I-nodes in 
the desired way. This raises the important question of 
data organization in such a system, with possibly a disk 
at each I-node. The pattern of data access during the 
algorithm dictates the way the data is to be stored on the 
disks. The algorithm, in turn, depends on the way the 
data is organized in the system. Hence, the algorithm 
design and data distribution problems need to be con- 
sidered together. Sometimes, the data needs to be organ- 
ized in different ways during different phases of a com- 
putation. It is possible to build a system where the I/O 
nodes can carry out data reorganization in parallel while 
the processors carry out the computation. 


An embedding in a 4-cube that enables parallel I/O 
operations is shown in Fig.8. We add an extra link 
between the two I-nodes in a 3-cube. Thus the two I/O 
processors in a 3-cube are directly connected and can 
carry out any data transfer between the two I-nodes, 
while the nodes can communicate in parallel over the 
system links. In the Section 2, we aimed at obtaining an 
embedding that utilized minimum number of I/O nodes 
such that each processor is adjacent to at least one I/O 
processor. As a result, the I/O processors are required to 
be at a hamming distance of three or greater from one 
another. With an extra link between the two I-nodes in a 


Fig.8. An embedding in a 4-cube 


3-cube as shown in Fig.8, the I-nodes can communicate in 
an efficient manner. The I-nodes in a 3-cube are seen to be 
connected in a 1-cube fashion. When two such 3-cubes 
are connected together to form a 4-cube, the I-nodes are 
connected in a 2-cube. In general, if an n-cube is built 
out of the basic 3-cube, shown in Fig.8, the I-nodes are 
connected as a (n-2)-cube. The advantage of such a 
configuration is that any algorithms developed for 
eficient data movement over the cube can be used for 
data reorganization among the I-nodes since they are now 
connected in the form of a cube themselves. The data 
reorganization issues are discussed in greater detail else- 


where [13]. If the basic 3-cube is initially balanced with | 


respect to I/O and computation power, any resulting 
larger configuration is seen to have the same desirable 
property. 


S- GENERALIZATION TO REGULAR 
INTERCONNECTION NETWORKS 


In this section, we consider generalizing the I/O 
embedding idea to a general interconnection network. An 
interconnection network is normally regular and we can 
represent such a network by a k-regular graph. A k- 
regular graph is a graph in which each node has a degree 
k. Let the number of nodes in such a graph be n. It is to 
be noted that n2* in a general k-regular graph and no 
such simple relation exists between n and k. What are the 
necessary conditions for a given k-regular graph to have a 
perfect embedding? 


LEMMA 3: For a given k-regular graph on n nodes to 
have a perfect embedding, (kK+1) should divide n and 
BELA I has tobe evel 

(K+1) 

PROOF: Each I-node in the graph serves as an I/O 
processor for (k+1) nodes. If the graph has a perfect 
embedding, each node is adjacent to exactly one I-node 
and this implies that (k+1) should divide n. Now con- 
sider a perfect embedding in a given graph. Since each 
P-node is adjacent to exactly one I-node, and the graph is 
k-regular, the partition containing only P-nodes has to be 


(k-1 lar. eee See pene des i 
) regular. There are n TEED 7 teary 2008s in 
this partition. Now counting the degrees on each node, we 


nk lk ; 
require 65 Be to be even which is equal to twice the 


number of edges within this partition. 0 


For example, consider the cube-connected cycles [14] 
in 3-dimensions. The network has 3*8 = 24 nodes and 
each node has degree of 3 i., n = 24 and k = 3. The 
numbers n and k satisfy the given necessary conditions 
and a perfect embedding for this network is shown in 
Fig.4. Are these necessary conditions sufficient? No. We 
give a simple counter example in Fig.5. where n = 12 and 
kK = 3. Though this graph satisfies the necessary condi- 
tions, no perfect embedding exists. 
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Fig. 4. A perfect embedding ina 
cube-connected cycles network. 


Fig.5. A Counterexample with necessary conditions. 


LEMMA 4: The problem of finding a perfect embed- 
ding in a k-regular graph is NP-complete. 


PrRooF: Given a k-regular graph on n nodes, the 
problem of finding a perfect embedding is same as that of 
finding a dominating set in that graph. A dominating set 
V'CV is a set of nodes such that every node veV—V' is 
adjacent to at least one node belonging to V’. A perf ect 
embedding requires that every node in the graph be adja- 
cent to exactly one I-node and hence the set of I-nodes 
forms a dominating set. The perfect embedding problem 
now can be seen as the problem of finding a dominating 


n : 
: "| = . A perfect embedding requires 


that V' be both a dominating set and an independent set 
and finding such a set is known to be NP-complete [15]. 
Hence, the problem of finding a perfect embedding in a 
k-regular graph is NP-complete. O 


However, some of the commonly used networks 
such as ring, star, binary tree, and completely connected 
networks have simple characterizations for a perfect 
embedding. Networks that do not have simple characteri- 
zations for a perfect embedding include meshes and 
toroids. 


DEFINITION 4: A minimal I/O embedding is defined 
as an I/O embedding from which we cannot discard an 
I-node and still obtain an I/O embedding of the graph. 


It is noted that a minimal embedding may not be 
minimum or perfect. A minimal embedding requires that 
the set of vertices V be partitioned into two sets 7, an 
independent set and the set of its neighbors N(T) such 
that T+N(T) = V. An independent set is a set of nodes 
with no edges between them. | 


LEMMA 5: A graph can always be partitioned into 
two sets 7, an independent set and N(T7), the set of its 
neighbors such that T+N(T) = V. 


PROOF: If we can partition the graph in such a 
manner, then we can make T the set of I-nodes and N(T) 
the set of P-nodes to obtain a minimal embedding of the 
network. Proof by contradiction: assume that it is not 
possible to partition a'graph in such a manner. Consider a 
maximal independent set 7. If T+N(T)¥V, then there 
must exist a third set of vertices S such that T+N(T)+S 
= V. The vertices in S are not adjacent to any vertex in 7, 
otherwise they would be in N(7). Then we can choose a 
vertex u from S and augment T with it to obtain a larger 
independent set contradicting that 7 is a maximal 
independent set. Hence, by contradiction, we can always 
find such a partition that gives a minimal embedding of a 
given network. ‘m) 


The proof of Lemma 3 gives us the following algo- 
rithm to find a minimal embedding in a given graph. 
Choose a node from the set of nodes V. Put this in the set 
T and call the set of its neighbors N(T) as shown in 
Fig.6. Let S = V-T—N(T). If possible, choose a node 
from S that is not in N(NC(T)). If not, choose any node in 
S. Put this in J and continue as above till the set S is 
empty. This algorithm gives a minimal embedding, not a 
minimum or perfect embedding. For example, consider a 
2-regular bipartite graph on 12 vertices as shown in Fig.7. 


P-nodes 


I-nodes 


Fig.6. A minimal embedding of a graph. 
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Start with node O in 7, then we have N(T) = {6,7} and S 
= {1,2,3,4,5,8.9.10,11}. Choose a node, say 2, from S that 
is not in N(N(T) (since it is possible). Now T = {0,2}, 
N(T) = {6,7,8,9} and S = {1,3,4,5,10,11}. Continuing like 
this, we may get an embedding T = {0,2,10,5,1}, which is 
minimal but not minimum or perfect. A perfect embed- 
ding exists in this graph and it is given by {0,8,3,11}. 


By carrying out all the choices, one could find all 
the minimal embeddings and choose the minimum 
embedding from them. The given graph may have a 
number of minimal embeddings and hence considerable 
effort may be needed to find the minimum embedding 
this way. A perfect embedding in a k-regular graph 
requires that there exist an independent set 7 such that 
T+N(T) = V and INCT)I =. kITI, where INC(T)I and ITI are 
the cardinalities of the two sets N(T) and T respectively. 
One could use this condition to check if a minimal 
embedding is a perfect embedding. 


0 6 
1 S. 7 
2 8 
3 9 
4 10 
5 11 


Fig.7, An example network for finding 
a minimal embedding. 


6. CONCLUSIONS 


In this paper, we presented a new method of con- 
necting I/O processors and processors in a hypercube 
multiprocessor system. Necessary and sufficient condi- 
tions are derived to obtain a perfect embedding. It is 
shown that, the proposed method achieves a higher degree 
of I/O adjacency and higher degree of fault tolerance 
with the same number of I/O processors. The practical 
implications of this method are discussed. The problem is 
generalized to a k-regular interconnection network and it 
is shown that the conditions derived for a hypercube are 
necessary but not sufficient. It is shown that finding a 
perfect embedding in a k-regular interconnection network 
is NP-complete. We also presented an algorithm to find a 
minimal embedding in a general interconnection network. 
It would be interesting to see how concurrent I/O may 
improve the performance of the numerous algorithms 
developed for the hypercube. 
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A Unified Approach to Designing Fault-Tolerant Processor Ensembles 
( Extended Abstract ) 


S. Chakravarty ' 
Dept. of Computer Science 
State University of New York 
Buffalo, NY 14260 


Abstract —- Processor ensembles ( abbrev. PEN ) form part 
of parallel processing systems. We present a unified approach to de- 
signing fault-tolerant PENs. Our approach is illustrated by present- 
ing fault-tolerant schemes for several commonly used interconnection 
topologies. Our fault-tolerance scheme is shown to be “area-efficient”, 
unlike another fault-tolerance scheme viz. the Diogenes approach|7]. 
Unlike the reliability analysis of fault-tolerant PENs that have ap- 
peared in the literature our reliability analysis takes into account 
switch failures along with processor and link failures. 


1. Introduction 


Processor ensembles ( PEN) form part of parallel processing sys- 
tems. The parallel processing system could be a parallel machine or 
a special purpose VLSI chip. We assume that parallel processing sys- 
tems using PENs consist of a control unit ( CU ) and a PEN with 
N processing elements ( PE ). The N PEs in the PEN communicate 
with each other through a set of communication links. The intercon- 
nection topology of a PEN is represented by a graph where the nodes 
of the graph represent the PEs and the edges represent the commu- 
nication links between the PEs. There exists an edge between nodes 
I and J if and only if there exists a communication link between the 
corresponding pairs of processors. 

The PEs could be very complex in which case each PE is integrated 
on a seperate chip and are interconnected to form a PEN. In this case, 
processor or link failures lead to low system availability. On the other 
hand, the PEs could be very simple in which case the PEN can be 
integrated on a single chip. In this case processor or link failures 
lead to low yield. This motivates the need to design PENs that can 
tolerate link and/or processor failures. In this paper we address the 
problem of designing fault-tolerant PENs. It is assumed that the CU 


can diagnose the faulty processors and it is capable of reconfiguring © 


the PEN by setting appropriate control signals. 

We present a unified approach to designing fault-tolerant PENs. 
Our approach is illustrated by presenting fault-tolerant schemes for 
a number of commonly occurring interconnection topologies like the 
Binary Tree, X-Tree, Mesh, Hypercube, Pyramid etc. These PENs 
can be recursively defined and can be constructed by interconnecting 
a number of copies of a basic module( abrrev. BM). There could 
be a number of different BMs for each PEN and a number of copies 
of any one of them could be suitably interconnected to construct the 
PEN. 

The proposed fault-tolerance scheme is based on the above prin- 
ciple. We first determine a BM from which the given PEN can be 
constructed. Based on the BM so determined we design a fault- 
tolerant basic module ( abbrev. FTBM ). The fault-tolerant 
PEN is then constructed by suitably interconnecting the FTBMs. 

The problem of designing fault-tolerant Binary Trees has been ex- 
tensively studied(3,6,7,9,10]. We show that the fault-tolerance scheme 
for Binary Trees resulting from our approach is the same as the fault- 
tolerance scheme for Binary Trees proposed in [3]. 


An attractive feature of the scheme proposed here is that it is 
area-efficient. Let M be a PEN having N PEs; and Q be the 
corresponding fault-tolerant PEN derived from M using a specific 
fault-tolerance scheme. The fault-tolerance scheme is said to be area- 
efficient if there exists a layout of size O(p(n)) for Q given that there 
exists a layout of size O(p(n)) for M. 


5. J. Upadhyaya ? 
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Buffalo, NY 14260 


Not all fault-tolerance schemes are area efficient as illustrated by 
the fault-tolerance scheme for Binary Trees proposed in [6]. It is 
known that there exists a layout of size O(N) for a Binary Tree with 
N nodes/4]. But, the fault-tolerant N node Binary Tree resulting from 
the scheme proposed in [6] requires area equal to O(N Log(N)){3}. 
The Diogenes scheme[7] is also not an area efficient fault tolerant 
scheme. For certain N node PENs the area could increase by a factor 
of O(N) if the Diogenes approach is used. 

We also present a different approach to analyzing the reliability of 
fault-tolerant PENs. In our analysis we take into account the failure 
of the switches required for reconfiguring the system, along with link 
and processor failures. This is in contrast to the analysis in [3,5,6,8,9] 
where switches are assumed to be fault-free. We believe that our 
approach to reliability analysis of PENs is more accurate than the 
approach in [3,5,6,8,9]. 

2. Fault-Tolerant Basic Modules 


In our approach to designing fault-tolerant PENs we first define a 
basic module ( BM ) for the PEN of interest. The BM should be 
such that a number of copies of the BM could be interconnected to 
construct the PEN. Figure 2.1(a) shows a BM for Meshes having even 
number of rows and an even number of columns. The corresponding 
FTBM is shown in Figure 2.1(b). 

Given a BM we construct an FTBM as follows. The number n 
of corners of the FTBM equals the number of PEs in the BM. The. 
corners of the FTBM are named C}),...,C,. Every FTBM with n 
corners has (n + 1) PEs, PE,,..., PE, , and SPE. SPE is a spare 
PE which is used to replace a faulty PE within the FTBM. Since our 
scheme uses only one spare within an FTBM, our scheme can tolerate 
only one processor failure within an FTBM. One can easily extend 
the scheme to tolerate multiple failures within an FTBM by inserting 
more than one spare processor per FTBM. 

Given a PEN the number of neighbors of a processor in the PEN 
i.e. the number k of ports per processor is known. For example, all 
processors in a Mesh are connected to 4 other processors; so the num- 
ber of ports per processor in a Mesh is 4. Every corner C; consists of 


k cornerpoints P;,,..., Pi... For all 1 <i<n, processor PE; has 
k ports Li1,..., Lin. The SPE has k ports named SIj,...,SDx. 

FTBM also contains a number of switches which are used for re- 
configuring the FTBM. Each switch can be either open or close. The 
arrangement of the switches in FTBM is described below. 

If PE; is fault free then it is placedin corner C;. For alll] <i<n, 
associated with PE; are a set of switches Sn4; ..., 5:4. To place PE; 
in corner C; port L; 4 is connected to the cornerpoint P; 4 by closing 
switch S;,, for all 1 <t <k. Note that in order to connect a port toa 
cornerpoint we require a switch; and a link that connects the port to 
the cornerpoint. In our discussion we consider such links to be part 
of the switch itself. If PE; is faulty then PE; is removed from corner 
C;, by opening all the switches associated with PE,. 

SPE has associated with it nx k switches. These n xk switches are 
divided into n groups Gj,...,G, corresponding to the n corners of 
the FTBM. Each group consist of k switches and for each port of the 
SPE we have one switch in each of the groups. The switches in group 
G; are named SW,1,...,SW,,%. For all 1 <t<k, switch SW; 1s 
used for connecting/disconnecting port SL; to/from the cornerpoint 
P;,. If none of the PEs are faulty then all the switches associated 
with the SPE are open and the SPE is not used. If there exists an 2 
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such that PE; is faulty then all the switches in group G; are closed 
and the switches in all the other groups are opened. This places the 
SPE in corner C;,. 

Figure 2.1(b) shows the naming of the various components of the 
FTBM for the corner C;. Observe that for each FTBM we can use n 
control signals, £,,..., E,, for reconfiguring the FTBM as discussed 
below. The switches that belong to group G; and the switches asso- 
ciated with PE, are ganged. If E; is 1 ( respectively 0 ) then all the 
switches associated with PE; are closed ( respectively open) and all 
the switches in G, are open ( respectively closed ). 

The interconnection topology within an FTBM is specified by pro- 
viding a graph G(V,E). V is the set of cornerpoints and there exists 
an edge between two nodes in the graph if and only if the two corner 
points are connected. In the sequel, we refer to these links as intra- 
FTBM links. The interconnection topology for an FTBM can be 
derived from the BM of interest. 

Note that FTBMs also have some external links which are used 
to interconnect the copies of the FTBM while constructing the fault- 
tolerant system. We refer to these external links as inter-FTBM 
links. The inter-F TBM links for an FTBM can be derived from the 
architecture of interest and the BM being used. In fact, the inter- 
FTBM links are identical to the inter-BM links used for interconnect- 
ing the BMs. 

We next derive a set of identities which are used in the reliability 
analysis to follow. Switches could be either stuck-close, stuck-open 
or normal and that switch-failures are independent. Let m be the 
number of intra-F TBM links; p be the probability that a processor 
is fault free; p, be the probability that a switch is normal; p, be the 
probability that a switch is stuck-open; p, be the probability that a 
switch is stuck-close; py, be the probability that a link is fault-free; 
and Py, be the probability that we have a working FTBM. | 

Observe that we have a working FTBM if and only if one of the 
following (n + 1) disjoint events occur. (i) PE,,..., PE, are fault- 
free; all the switches associated with PE ,,..., PE, are closed; and all 
the switches associated with SPE are open. (ii) For each 1 <i<n 
PE, is faulty; all switches associated with PE; are open; all switches 
associated with PE;, i # j, are closed; the switches in group G; are 
closed; and all switches in group G;, i # j, are open. From this we 
can derive the following expression for Py. 


Py — IP" x [(Pr + p.)* x (Pn +po)*}"| x Pr 
+ mx (1—p) xp" x [(Pn +Pc)* x (Pn + Po)*]” x PE 
Pu = p” x l(a + pc) x (Pn + Po)*]” x [1 a n(1 _o P)| x pr (1) 


Let K be the number of FTBMs required and T the number of 
inter-F TBM links required to construct the PEN. The reliability R 
of the fault-tolerant PEN is given by the following equation. 


R= Py x Py (2) 


The values of K and T are computed for the PENs under consider- 
ation and are dependent on the BMs used. It should be clear from 
the above discussion that the values of n, k, m, K and T along with 
equations (1) and (2) are sufficient for reliability analysis of the fault- 
tolerant PEN. 


3. Fault-Tolerant Pyramids 


Pyramids have found extensive use in image-processing|8]. A Pyra- 
mid of size 16 is shown in Figure 3.1. Each processor is connected to 
as many as 9 neighbors. A PE in level 2 is connected to one PE in 
level (i+1) using the up link, provided level (i+1) exists; to four PEs 
in level 2 using the north, south, east and west links; and to four PEs 
in level (2-1) using the downlinks. Note that every level, except the 
level containing the apex, has an even number of PEs which is also a 
perfect square. 

Our fault-tolerant Pyramid uses the FTBM derived from the BM 
of Figure 3.2, as discussed in Section 2. For this FTBM we have n = 
4;m = 4; and k = 9. The inter-F TBM links are identical to the links 


of the BM shown in Figure 3.2. N, and N» are the two north links. 
E, and Es; are the two east links. S3 and Sq are its two south 
links. W, and W, are the two west links. Each FTBM has one 
up-nezghbor. All corners of the FTBM are connected to exactly one 
corner of its up-neighbor. Accordingly, each corner C; of the FTBM 
has one up link U; which is used for connecting the corner C; of 
the FTBM to its up-neighbor. Each corner C; has four down links, 
D},, D3, D3, D4, which are used for connecting the corner C; of the 
FTBM to its four down neighbors. 

Figure 3.1 shows how the FTBM of Figure 3.2 can be used to 
construct the Pyramid. Note that the level containing the apex of 
the pyramid contains only one processor. In our discussion we assume 


that the root is also made up of one FTBM and only one of its corners » 
is used. In Figure 3.1 the FTBMs are demarcated using dashes. 

We next present a sketch of the derivation of the reliability ex- 
pression of the fault-tolerant PEN resulting from this scheme. For 
a complete derivation refer to [2]. Consider a pyramid with L +1 
levels. Let N; be the number of FTBMs at level i; and K be the 
total number of FTBMs. Then, for all O <i< ZL—1, N; = 4%>*7}. 
Therefore, K = 1+ Ne = at | 

To compute the number T of inter-FTBM links the inter-FTBM 
links are grouped into G;,G2. G, consists of all the inter-level links 
and G>» consist of all the intra-level links. Let JT, = |Gi|; and Tz = 
IG2|. T, = 4x #=2. 

There are no intra-level links for the level containing the root. 
For all 0 < i < L—1, let T3 be the number of intra-level links at 
level i; T?, be the number of horizontal links between the columns of 
FTBM of level 7; and JT} be the number of links between the rows 
of FTBM of level i. We have, Ti, = 2 x VN; x (Nj — 1); and 


Ti = Ti, + Th = 2x Ti. Therefore, Tj = 4x 4°3> x [4 — 1]. 


x \4 
L 
Also, Ty = eee Tis ee) Therefore, JT = 7, + Jo = 
4x (2% — 1) (2441 - 1). 
For this FTBM, n = 4,k = 9and m= 4. The reliability expression 
for the fault-tolerant PEN can now be computed using equation (2) 
of Section 2. 


i 
2 


4, Other Common Topologies 


We discuss briefly the fault-tolerant schemes for Binary Trees, Meshes 
and Hypercubes derived using our approach. The scheme for X-Tree 
is very similiar to the Binary Tree and is discussed in [2]. 

The Binary Tree structure is well known. A BM for the Binary 
Tree is shown in Figure 4.1(a). The corresponding FTBM for the 
BM of Figure 4.1(a) can be derived as discussed in Section 2. Note 
that the FTBM so derived is the module used in [3] for designing the 
reconfigurable Binary Tree. Figure 4.1(b) shows how the FTBM can 
be interconnected to form the Binary Tree. 

For this FTBM we have n = 3; m = 2; and k = 3. Let K be the 
number of FTBMs; 2M be the number of levels in the tree; and T be 
the number of inter-FTBM. Therefore, as shown in [2], K = 4-1 


and J'= anos. From the values of n,m and k the expression for Py 


can be derived using equation (1) of Section 2. From the values of 
Py,T and K computed above, we can derive the reliability expression 
for the Binary Tree using equation (2) of Section 2. 

Meshes, as an interconnection topology, have been used in the 
ILLIAC IV and studied in [1]. Figure 4.2(a) depicts a mesh whose 
sides are of size n and we assume that n = 2M. Figure 2.1(a) depicts 
a BM for the mesh. For the corresponding FTBM we have n = 4; k = 
4: and m = 4. Figure 4.2(b) shows how the BM of Figure 2.1(a) can 
be interconnected to forma mesh whose sides are an even number. 
We have T = M?; and K = 4(/~1)(2]. From this and the model for 
FTBM in Section 2 we can derive the reliability expressions for the 
mesh. 


The hypercube of size N, where N = 2%, is defined as follows. 
Each PE is assumed to have a q(= log(N)) bit address. Two PEs 
are connected if and only if their address differs in exactly one bit 
position. A BM for the hypercube is shown in Figure 4.3(a) and a 
method for constructing the hypercube from the BM of Figure 4.3(a) 
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is shown in Figure 4.3(b). We can view the BMs to be made up of the 
four processors that have the g—2 most significant bits to be identical. 
Accordingly, we can assign a g — 2 bit address to each of the FTBMs 
used. Then, two FTBMs have an inter-F TBM link between them if 
and only if their qg — 2 bit address differ in atmost one bit position. 
Therefore, K = 29~?; and T = (q — 2) x 2971{2]. 


5. Comparison with the Diogenes Approach 


The Diogenes approach|7] has been shown to be applicable to a variety 
of PENs. Here we compare our approach with the Diogenes approach. 

The first figure of merit we use is the area required by the 
fault-tolerant PENs designed using the two approaches. We had noted 
earlier that the fault-tolerant Binary Tree resulting from our approach 
is the same as the fault-tolerant Binary Tree of [3]. In (3] it was shown 
that the fault-tolerant Binary Tree with N nodes has a layout of size 
O(N). We next present a layout strategy for our fault-tolerant PENs 
to shows that our scheme is area efficeint. 

Define the corner layout graph ( CLG ) of the fault-tolerant 
PEN as follows. For each FTBM used in constructing the PEN, CLG 
has n nodes, where n is the number of corners of the FTBM. There 
exists an edge between two nodes in the CLG if and only if there 
exists either an inter-F TBM link or an intra-F TBM link between the 
cornerpoints of the two corresponding corners. 

Define the FTBM layout graph ( FLG ) of the fault-tolerant 
PEN as follows. For each FTBM used in constructing the PEN, FLG 
has one node. There exists an edge between two nodes in the FLG if 
and only if there exists an inter-F TBM link between the correspond- 
ing FTBMs. The following observation follows from the definition 
of FLG. The FLG of the fault-tolerant Binary- Tree, X-Tree, Mesh, 
Pyramid and Hypercubes presented in this discussion are respectively 
Binary- Tree, X-Tree, Mesh, Pyramid and Hypercube. 

The CLG can be derived from the FLG by replacing each node in 
FLG by the corner graph of the FTBM which is defined as follows. 
For each corner of the FTBM the CG contains a node. There exists an 
edge between two nodes in the CG if and only if there exists an intra- 
FTBM link between the cornerpoints of the corresponding corners. 
The CGs of the different FTBMs used in this discussion are shown in 
Figure 5.1. We next describe the basic steps of our layout scheme. 

1. Construct the FLG H of the fault-tolerant PEN. 

2. Layout H using the layout algorithm for the PEN. 

3. Expand each node of H as follows. Let H,; be the corner graph 
of the FTBM in question. H; is said to be hamiltonian if and only if 
there exists an acyclic path pin H; such that p traverses each of the 
nodes of H,. An inspection of Figure 5.1 shows that the CGs of each 
of the FTBMs used in this discussion are hamiltonian. In Figure 5.1 
the hamiltonian paths are shown using dashed lines. 

The node in H; corresponding to the corner C; is henceforth re- 
ferred to as node C;. Without loss of generality let C,,...,Cy be the 
hamiltonian path. Then, C, contains the spare processor SP E; for all 
ae: contains the switches S31,---,5j,~5 and the processor PE;; for 
all 3, C; contains the switches in group G;; and for all 7, C; contains 
the cornerpoints of the corner C;. 

The edges between the nodes of the CG are expanded to include 
the intra-F TBM links between the cornerpoints. The connection be- 
tween the ports of the spare and the switches in the different groups 
are run parallel to the hamiltonian path. For each switch there is a 
link to it from the appropriate port of the SPE. Only n x k such links 
between two adjacent nodes in the hamiltonian path are needed. 

Asymptotically, the area required by the resultant layout will be 
the same as the area required by the FLG constructed in the first step 
provided the number of neighbors of a PE ina PEN isaconstant. This 
is because each of the nodes will be replaced by a constant number of 
switches and processors and each edge will be replaced by a constant 
number of edges. From this it follows that the fault-tolerant Binary- 
Tree, X-Tree, Mesh, Pyramids are all area efficient. 

The above layout strategy does not give an area efficient fault- 
tolerant Hypercube because the number of neighbors of processors in 
the Hypercube is not a constant. But, for the. Hypercube we use a 
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different layout strategy which for area efficient layout and is discussed 
in [2]. 

Unlike our fault-tolerant scheme, the Diogenes scheme is not area- 
efficient. This is because, as stated in [7], the Diogenes approach uses 
collinear layouts/4]. Collinear layouts suffer from the drawback that 
they do not lead to optimal area. For example, there exists a layout 
viz. the H-Layout[4] for the Binary Tree which requires area O(N) but 
the area required by the collinear layout of a tree is O(N Log(N))|4]. 
Similiarly, the collinear layout for the Mesh requires area O(N VN){4] 
whereas, there exist layouts for the Mesh that require area O(N )|[4]. 
Thus we see that the fault-tolerant PENs derived using the Diogenes 
approach have an area overhead that is larger than the area overhead 
of the fault-tolerant PENs derived using our approach. The ratio of 
the areas required by the Diogenes approach and our approach could 
be as high as O(VN). 

The second figure of merit we use is the fault tolerance of the 
fault-tolerant PENs. It was stated in Section 2 that our scheme can 
tolerate only one processor failure per FTBM. The Diogenes approach 
is a globally fault-tolerant scheme in that it has the characteristic that 
if S spares are used it can recover from any S processor failures. This 
is often referred to as 100% spare utilization. Therefore, the Diogenes 
approach has greater fault-tolerance than our approach. 


6. Discussion 


We presented an area efficient, locally redundant fault-tolerance scheme 
for PIENs and showed it to be applicable to a variety of PENs. The 
Diogenes approach which is a global redundancy scheme having 100% 
spare utilization is not area efficient. These two schemes represent the 
two extremes of the spectrum of fault-tolerance schemes for PENs. 
For the Binary Tree a number of other fault-tolerant schemes [5,6,10] 
has also been proposed. These schemes are either a combination of 
local and global schemes([5,6] or local schemes using a varying number 
of spares/10]. 

A new approach to analyzing the reliability of PENs was pre- 
sented. Our reliability analysis takes into account failure of switches 
which are used for reconfiguring the system. This makes our reli- 
ability analysis more accurate than the reliability analysis of fault 
tolerant PENs that had appeared in the literature which does not 
take into account switch failures. 
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Abstract--The reliable execution of the 
critical tasks on computing systems where faulty 
responses can jeopardize human life or can cause 
a vast loss of money are important design issues. 

This work studies the reliable execution of 
the tasks in an environment where processors and 
interprocessor communication channels are subject 
to failure. Every task is assigned to a group 
of the processors for execution. The processors 
of a group compare their outputs to obtain the 
group output. 

Performance is improved if the number of 
concurrent tasks (7.e., the number of groups) is 
maximized. In order to achieve that, a system 
is modeled with the use of a graph where a node 
and a link of the graph represent a processor and 
a communication channel between two processors 
in the system, respectively. A new concept 
called group maximum matching, which is an ex- 
tension to the classical maximum matching, is 
introduced. In a group maximum matching the 
nodes of a graph are grouped such that no two 
groups share the same node, the nodes of a group 
comprise a connected subgraph of the graph, and 
the number of groups is maximum. A heuristic 
algorithm is proposed to obtain a group maximum 
matching. 

A fault-tolerant scheduling algorithm based 
on the group maximum matching is developed which 
ensures the error-free execution of the tasks. 
Furthermore, the proposed algorithm has the capa- 
bility for the on-line fault-diagnosis of the 
faulty processors and interprocessor communica- 
tion channels. Fault-diagnosis is achieved as 
system runs its normal tasks; hence, a diagnosis 
program is not needed for that purpose. 

I. Introduction 

Reliable execution of critical tasks on 
computing systems where faulty responses can 
jeopardize human life or can cause a vast loss 
of money are important design issues. The most 
stringent requirement for reliability is in the 
real time contro] systems where repair is not 
possible and recovery time from faults must be 
short. The design of fast, available, reliable, 
and dependable computers has become possible 
with the use of multiple processor systems. 

Non-fault-tolerant scheduling of the tasks 
are considered in [1]-[5], where it is assumed 
that processors and interprocessor communication 
channels are fault-free. 

System level fault diagnosis was introduced 
in [6], where processors test each other for the 
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detection and location of the faults, and gener- 
alized in [7]. The above techniques, in general, 
require first diagnosis of the faults, then re- 
covery from the faults. However, testing a pro- 
cessor for the detection of all kinds of faults 
is debatable and is not an easy job; and also 
subsequent error recovery takes time and is gen- 
erally slow. Therefore, this has motivated the 
introduction of another model called comparison 
model where at least two processors are assigned 
to execute the same task and their outputs are 
compared. 

Comparison model is employed in [8]-[16]. 
However, works in [8]-[14] are geared only toward 
fault-diagnosis of the processors. As a result 
the assignment of the jobs to the processors is 
not based on any efficient job allocation scheme 
which maximizes the system performance (1.e., 
the system throughput) while executing the jobs 
reliably. For instance, in [8] in order to check 
the status of every processor a minimum covering 
algorithm for single fault-diagnosis is proposed. 
Furthermore, in [8-10], it is assumed that a 
faulty processor produces an incorrect output. 
This does not hold true all the time unless the 
programs running on the processors are diagnostic 
programs rather than normal system or user pro- 
grams. The works in [11]-[13] do not consider 
assignment of jobs to the processors based on 
any particular allocation scheme. The work in 
[14] assign jobs to the processors in rounds 
based on a permutation scheme in which processors 
with short jobs finish early and remain idle 
until the end of the current round of the execu- 
tion. 

In [8-14] only processor failure is consid- 
ered, we consider both processor and interpro- 
cessor communication channel failures. 

We consider a homogeneous system in which 
every task can be assigned to any processor. 
Fxamples of such systems are Intel IPSC System 
[17], and several other systems [18]-[20]. We 
assume that tasks are independent tasks, do not 
communicate with each other, and are either in- 
dependent subtasks of a job or independent jobs. 
We further assume a non-preemptive scheduling 
scheme where as soon as a task is scheduled to 
run on a processor, it runs to the completion of 
the task. We also assume that a faulty processor 
does not have Byzantine malicious behavior and 
its faulty behavior is solely caused by the 
faults affecting its hardware circuitry. Final- 
ly, we assume that when running a task on a 
faulty processor, faults in that processor may 
or may not affect the output of the task, which 


depend on the task and the nature and place of 

the faults in the processor. Thus, we consider 
the tasks running on the processors are normal 
user programs but not diagnostic programs which 
are carefully written to force the faulty pro- 
cessor to generate incorrect output. 

In this work, we are only concerned with 
fault-tolerant scheduling of the tasks and con- 
current diagnosis of the faults. But, the pro- 
posed algorithm may be combined with other 
scheduling policies [21] such as first-come, 
first-served, shortest job first, round robin, 
etc., under any type of constraint such as pri- 
ority and deadline of the tasks. Hence, in our 
work we do not specify how the next job from the 
job ready queue is selected for the execution. 

In the next section II, we propose an algo- 
rithm for the effective pairing or grouping of 
the processors. Then, in the section III, we 
propose our fault-tolerant scheduling algorithm 
based on the first algorithm. 

A. System Model 

A system is modeled by a graph G(V,E), where 
V and E are. the node set and the edge set of the 
graph G, respectively. A node represents a pro- 
cessor with its local memory, while an edge be- 
tween two nodes represents a communication chan- 
nel between their corresponding processors in the 
system. 


B. Fault-Model 


A comparison fault model is defined as fol- 
lows: When two processors P; and P43 which 
are assigned to execute the same task Ty, then 
they will do: First, each executes Ty inde- 
pendently, then they exchange their outputs, next 
P4(P5) will obtain its test outcome a4j(aji) with 
respect to Ty, as follows: 


1. If gash agrees with P5(P4) then ajj = 0 


(aj, = 0 
2. Else ajj =) (ajq = 1) 

a that: 

a) is not necessarily the same as aj}. 

b) Both a faulty processor or a faulty com- 
munication channel could be the source 
of the disagreement. 

Cc) 


Pi; and Ps may produce the same output and 
agree with each other even if one or both 
of them are faulty, which depends on 
whether faults in the processors § and 
communication channel between them affect 
their outputs or not. For instance, a 
fault could be in a register of one of 
the processors, but that fault cannot 
affect the output of the Ty, if that 
register is not used in executing Ty. 
II. Matching Concept 

In the classical graph theory the matching 

concept is defined as follows: 


Definition 1 [22]: Given a graph G(V,E), a 
matching is a subset of edges F CLE such that no 
two edges of F are adjacent. 
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A matching which covers every node is a 
perfect matching. 

A matching is maximum if there is no other 
matching which has more edges. A perfect match- 
ing is always a maximum matching, while the con- 
verse is not necessarily true. 

The concept of the classical matching prob- 
Tem is used to pair or to group the nodes of a 
graph into 2-node disjoint groups. A general- 
ization to that concept is to group the nodes 
into (t+1)-node disjoint groups. We refer to 
the generalization matching problem as the group 
matching problem. In the classical maximum 
matching problem, nodes are paired such that the 
number of pairs is maximum. Similarly, we define 
group maximum matching as a problem where nodes 
are grouped such that no two groups share the 
same node, the nodes of every group comprise a 
connected subgraph of the graph, and the number 
of groups iS maximum. 

A. Group Maximum Matching Algorithm 

‘In the following we develop an algorithm to 
find a group maximum matching for the graphs. 
This algorithm which is a greedy heuristic algo- 
rithm attempts to avoid the isolation of the 
nodes such that they can be included in a group. 
This is achieved by first including the nodes 
with the lower degrees in the groups, then the 
ones with the higher degrees. In this algorithm 
every group is intended to have (t + 1) nodes. 

ie 


Initialize i = 1, 9} = 4, go = $, and so 
on 

2. While system graph is non-empty do 

2.1 Find the node with the lowest degree. 
In case of a tie, choose one of them 
randomly. Call the node with the mini- 
mum degree the current node. 

2.2 Add the current-node to gj 

2.3 Sort the neighbors of the current node 
with respect to their degrees in an in- 
creasing order. 

2.4 If the current-node(s) have some neigh- 
bors then 

2.4.1 If the current-node(s) have t-|g;| + 1 
or more neighbors then 

2.4.1.1 Choose the first t-|gj] + 1 sorted 
neighbors and add them to gj 

2.4.1.2 Delete the nodes in gj from the system 
graph 

2.4.1.3 i=i+ 1] 

2.4.1.4 Go to 2 

2.4.2 Else 

2.4.2.1 Add all the neighbors of the current- 
node(s) to gj 

2.4.2.2 Sort the neighbors of the nodes in gj; 
with respect to their degrees in an in- 
creasing order. Call the nodes in g; 
the current nodes. 

2.4.2.3 Go to 2.4 

2.5 Else 

225.1 Delete the nodes in gj from the system 
graph 

2.5.2 j=i+] 

25533 Go to 2 

2.6 End while do 


Example 1: Figure 1.a shows an example with 
t = 2. Initially, every node is of degree 3. 


Nodes are sorted with respect to their degrees. 
Let the node P; be listeded on the top of the 


list. Then, Py is elected as the current node 
and added to g;. Let the neighbors of Py be 
sorted as {Po, Pyj2, Pg}. Then the first two 


neighbors are added to gj and gj = {Py}, Po P92}. 
Next the nodes in g} are deleted from the sys- 
tem graph and the graph shown in Fig. 1.b is ob- 
tained. Then, the nodes in this graph are sorted 
and the node P3 with the lowest degree of 1 
will be on the top of the list. P 3 is elected 
as the current node and added to gp. Then, 
its only neighbor Pq is added to gp. The 
sorted neighbors of the nodes in go are listed 
as {Py} 1, Ps}. Then, Py] is added to go and go = 
{P3, Pa, Pj1}- Nodes in go are deleted from the 
graph and the graph shown in Fig. 1.C is obtain- 
ed. Let, after sorting the nodes in this graph, 
P5 with the degree of 2 be on the top of the 
list. Then, Ps is elected as the current node 
and added to g3. Next, the sorted neighbors 
of Ps are listed as {P1q, Pg}, both are added to 
93, and g3 = {P5, Pg, Pio}. Nodes in gg are de- 
leted from the graph and the graph shown in Fig. 
1}.d is obtained. Subsequently, gq = {P7, Pg, Pag} 
and algorithm terminates. 

A group matching which groups the nodes 
randomly, generally turns out to generate a group 
matching with less number of groups than the 
proposed algorithm. For instance, Fig. 2 shows 
a matching with only three groups, 9g], 99, and g3 
for the same graph of Fig. 1 where, g] = {Pj, Po, 
Pi2t, da = {Pq4, Pio» Piz}, and gg = {Pg, Pg, Pg}. 
In this group non-maximum matching nodes P3, Ps 
and P7 are isolated because of the random 
grouping of the nodes and not using any state 
knowledge to group them effectively. For in- 
Stance, after the selection of the nodes of group 
gj, the nodes of the group go are selected 
randomly without the regard to the status of the 
node Pg which can be grouped only with the 
node Pg. A better strategy is first to try to 
include Pg in a group which has fewer neigh- 
bors to group with, and then try to include other 
nodes which have more neighbors to group with. 
This strategy is used in the proposed heuristic 
algorithm. 


B. Time Complexity of the Algorithm 


Assuming that sorting a list by the heap 
sort takes n log n time, then the total time is 
obtained as follows. The while-do loop and the 
step 2.4 are repeated at most n and t times, 
respectively. Thus, the total time is in the 
order of n.(t.(n log n)) or n@.t.logn. 


III. Fault-Tolerant Scheduling Algorithm 


This algorithm is devised to execute tasks 
reliably and also to achieve on-line fault- 
diagnosis in the environment where processors and 
communication channels are subject to failure. 
For the reliable execution of the tasks each task 
is assigned to a group of processors. Processors 
are grouped with the use of group maximum match- 
ing algorithm. A task is released if at least 
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(t + 1) processors: produce the same output with 
the assumption that no more than t faulty pro- 
cessors or communication channels exist. 

In order to achieve on-line fault-diagnosis 
the notion of a disagreement graph G(D,M) with 
the node set D and the edge set M is introduced 
as follows. 


A. Disaqreement Graph 


A disagreement graph G(D,M) with respect to 
a task T is obtained for a group of the nodes 
which are assigned to execute the task T as fol- 
lows: Every node Dj of the disagreement graph 
contains those nodes of the group such that for 
every node Pj « Dj and Py e« Dy, Py and Py agree 
with each other on the output for the task IT, 
i.e., aje = axj = 0 if they are neighbors in 
the system graph. An edge exists between two 
nodes Dy; and Dj of the disagreement graph if 
there exists a node Py « Dy and anode Py « Dj 
such that P, and Py are neighbors in the system 
graph and they disagree with each other on the 
output for the task T, i.e., either ayy = 1 or 
ayy = 1. 
ia In the following the formal description of 
the fault-tolerant task scheduling algorithm is 
given. 


B. Fault-Tolerant Scheduling Algorithm 


Te Run the maximum matching algorithm for 
t = 1 or the group maximum matching 
algorithm for t > 1 to group the nodes 
of the system into groups gj, 92, ---, 
and so on. 

ae Assign every task Ty to a group gp 
for the execution by all the nodes in 
that group, where |gp| > 1 and all the 
nodes in gh are free. 

3% Upon the completion of every task Jj 
by all the nodes assigned Tj do 


3.1] Ask the nodes which are assigned 1; 
to exchange and compare their outputs 
if they are neighbors in the system 
graph, and then to obtain their test 
outcomes with respect to the task Tj. 

gue Obtain the disagreement graph G(D,M) 
for the task 1T;. Let Dy], Do, ..., and 
so on be the nodes of the disagreement 
graph with respect to 14 

3.3 For every Dj with 0 < |Dj] < t do 

3.3 For every node Pj « Dj do 

Sedalia) If the number of nodes Pj disagreed 
with in this round of execution of 
Tj was more than t, then consider 
Pj faulty, add it to the faulty set 
Sf, and set Dj = Dy - {Pj} 

3.3.1.2 End for do 

3.3.2 End for do 

3.4 Find the first Dy with O < [Dy{[ <t 
such that the nodes in Dj have some. 
neighbors Pj which are neither in 
the faulty set Sp nor in Dj 

3.4.1 If there exists at least a Dy then 

3.4.1.1 If among the above neighbors in 3.4 
there are some nodes which are idle 


(i.e., not included in any group g, 
with [g,{ > t) then select them 


3.4.1.2 Else 

3.4.1.2.1 Select a node P; among the above 
neighbors in 3.4 which is running a 
task Ts; with the lowest priority, in 
case of a tie, choose one of the neigh- 
bors randomly 

3.4.1.2.2 Abort the task Tj assigned to Pj 
and to all of the nodes in its group. 
Tj must be rescheduled for execution 
by some other nodes as is done in the 
step 2. 


3.4.1.2.3 Select the node Ps and all of the 


nodes in the group of P 
Assign the task Tj to all of the 


3.4.1.3 
selected nodes in 3.4.1.1 or 3.4.1.2.3 
and also again to all of the nodes in 
the current disagreement graph with 
respect to 1j. 

3.4.1.4 Go to 3. 

3.5 If there are any |Dj| > t during the 


above process then release the output 
of Tj executed by one node in Dj. 

3.6 Else abort T; because can not find 
at least t+] nodes to agree with each 
other over 1Tj. 

3.] The link between two fault-free pro- 
cessors is considered faulty if they 
disagree with each other. 

3.8 If no node or link has failed or there 
are still some disagreements with re- 
spect to some other tasks not yet re- 
solved then go to 2 

3.9 Else 
3.9.1 Finish the tasks already in progress 
with the use of the old group maximum 
matching and concurrently go to 1 to 
get a new group maximum matching after 
the deletion of the faulty nodes and 
links from the system graph. 

Assign the new tasks based on the new 

group maximum matching. 
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Example 2: Consider the system graph given 
jin Fig. 3.a. Assume that t = 1 and a maximum 
matching, obtained by running the step-1 of the 
algorithm, pairs the nodes into the groups g] = 
{Py}, Po}, Jo = {P3, Pq}, 93 = iP5, Pot, Gq = {P7, 
Pg} and gs = {Pg, Pjo}. Assume that task Tq is 
assigned to the nodes ing; for i= 1, 2, 3, 4, 
and 5. Furthermore, assume that the node Po and 
the link Ps - Pg are faulty. Upon the completion 
of T;, the disagreement graph with respect to 1j 
shown in Fig. 3.b is obtained. Assuming that the 
priority of Ts is greater than To, then the task 
To is aborted and the nodes in g}Wgo are 
assigned to perform 1}, see Fig. 3.c. Then, 
assume that nodes in gg finish executing T3. 
Figure 3.e shows the disagreement graph with re- 
spect to Tg. Assuming that the priority of 
Ty iS greater than that of Tq, then the task 
Tq is aborted and the task T3 is assigned to 
the nodes in g3U gq, see Fig. 3.f. Next, 
assume that Ts finishes. Figure 3.g shows the 
disagreement graph with respect to T5. Since 
[D}| > t = 1, T5 is released and a new task Tg is 
assigned to the nodes in gs, see Fig. 3.h. Upon 
the completion of Tj, the disagreement graph with 
respect to 1; is obtained, see Fig. 3.1. Since 
[Dj =1.< t, the lower priority task Tg, is 


aborted and the nodes in g} VU go uw gs are assign- 
ed to execute Tj, see Fig. 3.3. Then, assume 
that 1, finishes and the disagreement graph 
with respect to 1; shown in Fig. 3.k is ob- 
tained. Since the number of nodes Po has dis- 
agreed with is greater than t = 1; Po is con- 
cluded to be faulty. Then, task Ty is released 
because there is at least a Dj (i.e., either Dy 
or 03) with |Dj] >t=1. Thereafter; since 
there still exists disagreement with respect to 
the task T3, tasks To and Tg are assigned to the 
freed nodes in gg and gs, respectively; see Fig. 
3.1. Next, assume that T3 completes and the dis- 
agreement graph shown in Fig. 3. mis obtained. 
Since |Do| > t= 1 and |0;/ < t=1, the lower 
priority task Tp is aborted, and task T3 is 
assigned to the nodes in goVg3u gq; see 


Fig. 3.n. Upon the completion of T3, the dis- 
agreement graph in Fig. 3.p is obtained. Since 
107) = 402] > t = 1, all the nodes in gov g3 


vu gq are fault free, but the link Ps - Pg is 
faulty. Then, task Tz is released. At this 
time no disagreement exists. Hence, a new maxi- 
mum matching pairs the nodes into the new pairs 
91 = {Py}, Pio}, Go = {Pg, Pg}, 93 = {Pe, P7}, and 
gq = {Pq, P5}. Task Tg continue to complete 
based on the old maximum matching by the nodes in 
the old group gs = {Pg, Pjo}. Tasks To and Tq 
are assigned based on the new maximum matching 
to the nodes in the new group gg and new ga, 
respectively; see Fig. 3.r. Next, assume that 
Tg completes and the disagreement graph shown 
in Fig. 5.s is obtained. Since [D}| >t = 1, 16 
is released and the use of the old maximum match- 
ing terminates, and the new tasks T7 and Tg are 
assigned, based on the new maximum matching, to 
the new groups g} and go as shown in Fig. 3.t. 


Example 3: Consider the system shown in Fig. 
4.a and assume that t = 2. Running the step-] 
of the algorithm partitions the nodes into the 
groups: gy = {Py], Po, Py2}, go = {P3, Pa. Piq}, 
g3 = {P5, Pg, Pigt, and gq = {P7, Pg, Pg}. As- 
sume that tasks 7}, To, T3, and Tq are assigned 
to gj, 92, 93, and gq; respectively. Further- 
more, assume that the node Pp and the link Py - 
P)o are faulty. Upon the completion of T), the 
disagreement graph shown in Fig. 4.b is obtained. 
Assume that task To, because of low priority, 
is aborted and nodes in g}Ugp are assigned 
Ty}, see Fig. 4.c. Upon the completion of 
Ty, the disagreement graph shown in Fig. 4.d 
is obtained. Then, assume that task Tq, be- 
cause of low priority is aborted and all the 
nodes in 9}; Vgo Ugq are assigned Tj, see Fig. 
4.e. Then disagreement graph shown in Fig. 4.f 
is obtained. Since the number of nodes Po has 
disagreed with is greater than t = 2, Po is 
concluded to be faulty. Also since |D] = [D3] = 
4 >t = 2, Py and Pyo are concluded to be fault- 
free, but the link between Py and Pjo. is faulty. 
Then, task 1, is released. Next a new group 
maximum matching is obtained which partitions the 
nodes into the new groups gj = {P}, P7, Pg}, 92 = 
{P3, Pg, Py2}, and gg = {Pg, Pig, P11}, see Fig. 
4.g. Task Tg continues to complete by the nodes 
in the old group g3 based on the old group maxi- 
mum matching. Tasks To and Tg are assigned to 
the new groups go and gj}, respectively, based on 


the new group maximum matching, see Fig. 4.g. 
Upon the completion of T3, the use of old 
group maximum matching terminates, and task Ts 
is assigned to the nodes of the new group g3, 


see Fig. 4.1, based on the new group maximum 
matching. 
Theorem 1: Every task Ty is executed 


error-free if the proposed algorithm is employed 
and the number of faculty processors and faulty 
communication channels (both temporary and per- 
manent faults) does not exceed an upper bound t 
during any round of execution of the task Tj. 


Proof: Every task Ty is released when at 
Teast (t + 1) or more processors produce the same 
output and agree with each other. Thus, as long 
as for every round of the execution of the task 
Tj their exists not more than t faults, then 
when Tj is released and it is error-free. Q.E.D. 


Theorem 2: The status of every processor 
or communication channel is identified correctly 
if a) the proposed algorithm is employed, b) the 
number of faulty processors and communication 
channels does not exceed t with respect to every 
round of the execution of each task, and c) every 
processor is in a connected subgraph of the sys- 
tem graph with at least t + 1 fault-free proces- 
sors which are connected to each other through 
some fault-free links. 


Proof: A processor is declared faulty if 
it disagrees with at least t + 1 other processors 
during the execution of a task Tj. As long as 
the condition (b) holds true, a fault-free pro- 
cessor will never be declared faulty. A faulty 
processor. will eventually be assigned a_ task 
Ty to run for which it will produce an incor- 
rect output. Then, as long as the conditions (b) 
and (c) hold true, the faulty processor will 
disagree with at least t + 1 fault-free proces- 
sors in a connected subgraph of the system graph. 


Hence, it will be declared faulty. The status 
of a faulty communication channel between two 
fault-free nodes is automatically considered 


faulty because of their disagreement with each 
other over that channel. Q.€E.D. 


Corollary 2: The status of every processor 
and communication channel is identified correctly 
if the conditions (a) and (b) of the Theorem 2 
hold true, the system graph is at least (t + 1) 
connected, and n > 2t + 1 (where n is the number 
of processors in the system). 


Proof: Deleting t faulty nodes and links 
from the system graph will keep it still connect~- 
ed with at least t + 1 fault-free nodes. This 
satisfies the condition (c) of Theorem 2. Q.€.D. 


C. Discussion: 


Faults are of two types~--permanent and tem- 
porary. Permanent faults describe permanent 
damage to the system components. Temporary 
faults are further subdivided into two classes—— 
intermittent and transient. Intermittent faults 
describe faults that are only occasionally pres- 


. processors 
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ent due to unstable hardware and are caused by 
factors such as loose connection, component 
aging, poor design, chip contamination, etc. 
Transient faults describe faults which are pres- 
ent due to undesired environmental disturbances 
such as radiation, humidity, temperature varia- 
tion, power supply fluctuation, physical vibra- 
tion, etc. Spillman [25] has studied the nature 
of the temporary faults, their detection tech- 
niques, and modeling their behavior. 

The proposed algorithm identifies the faulty 
and communication channels without 
specifying the type of the faults. But it is 
important to be able to differentiate permanent 
and intermittent faults from the transient 
faults. Because if the faults are permanent or 
intermittent, then the faulty components should 
not be reused any longer. We propose the fol- 
lowing supplementary, off-line and modified on- 
Tine testing techniques. 

Supplementary off-line testing can be car- 
ried on as follows: Every time a processor Pj 
is declared faulty with respect to a task Ty; 
by the algorithm, then task Ty must be record- 
ed. Then, at some later time when the system can 
run 14 in a clean and nice environment, the 
processor P; must be asked to run 15 again. 
If it produced an incorrect output again, then 
it is either permanently or intermittently 
faulty; otherwise, the sources of failure are 
either intermittent or transient faults. 

The modified on-line testing technique is 
based on the use of time redundancy. That is, 
in the occasions when some processors disagree 
with each other in executing a particular task 
Tj before expanding the disagreement graph by 
asking some more processors to execute Tj. the 
same processors should be asked to run Tj 
again. If the same disagreement graph was ob- 
tained, then expand the disagreement graph with 
respect to Vj by asking more processors to 
join in executing Tj. Otherwise, do not ex- 
pand the disagreement graph and ask the same 
group of processors to run Tj again. 

IV. Conclusion 

A fault-tolerant scheduling algorithm for 
error-free execution of the reliability critical 
programs was proposed. In that algorithm every 
program was assigned to a group of neighboring 
processors for execution. The conditions under 
which programs are executed error-free were 
given. A new concept called group maximum 
matching was introduced. This concept was used 
to maximize the system performance, ji.e., to 
maximize the number of concurrent groups or pro- 
grams running in the system. A heuristic algo- 
rithm for finding a group maximum matching was 
given. It is important to notice that every 
system is fault-free most of the time; hence, its 
average performance is dominated by its fault- 
free performance. The proposed group maximum 
matching attempts to maximize the system perfor- 
mance. 

The proposed fault-tolerant scheduling algo- 
rithm is geared toward fault-free execution of 
the tasks while it attempts to achieve on-line 
fault-diagnosis of the faulty processors” or 


jinterprocessor communication channels as system 


runs 
interesting feature and 


This is an 
the system 


its normal user programs. 


it frees 


designers from the troubles of writing diagnostic 
programs which can detect all kinds of faults in 
the processors in an acceptable amount of time. 
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THE RESILIENCY TRIPLE IN MULTIPROCESSOR SYSTEMS 
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Abstract: A multiprocessor system is represented by an 
architecture graph G, where the nodes correspond to 
processors (or computers) and the edges represent 
communication links among them. A job executed on a system 
G is represented by a computation graph H, which is a 
subgraph of G, where the nodes correspond to one or more 
tasks assigned to a particular processor and the edges represent 
communications among tasks that are allocated on different 
processors. In this paper we define three important 
parameters, multiplicity, robustness, and configurability, called 
the resiliency triple, pertinent to the fault tolerance in 
multiprocessor systems. We will discuss how each parameter 
is related to fault tolerance and fault recovery and how it is 
determined for a given G and H. We present solutions for H 
being a path and G being either a hypercube or a mesh. 


Key Words: Fault tolerance, multiplicity, robustness, 
configurability, multiprocessor systems, hypercube, mesh. 


1. INTRODUCTION 


The proliferation of ever more powerful and complex 
multiprocessor systems has made fault tolerance a necessity in 
today's computer design. Although a large amount of related 
research work has been reported in the literature, and 
considerable efforts are still being made by many researchers to 
perfect the art of multiprocessor fault tolerance, there is very 
little analytical work done in the area of fault recovery. The 
existing research on fault recovery is rather fragmented and 
application specific. Moreover, techniques requiring imposed 
component redundancy have been widely proposed, while the 
intrinsic redundancy associated with multiprocessor systems 
has been overlooked. Since mapping application programs 
precisely onto a system architecture is very difficult, a 
multiprocessor system is often not fully utilized at all times. 
This means that some processors in the system are often left 
idle at one time or another. This inherent component 
redundancy should enable fault recovery to be achieved by 
mapping programs around faulty processors. Recently, Harary 
and Malek have developed a graph theoretic framework for 
fault recovery in multiprocessor systems [1]. In their work, 
existing graph theoretic models for system architecture and 
program structure are referred to as the architecture graph (G) 
and the computation graph (H) respectively, and are used to 
formalize the studies of fault recovery. Several parameters that 
affect the effectiveness of a fault recovery technique in various 
ways are introduced to allow easier comparison of different 
methodologies and to quantify the optimization of fault 
recovery. Also introduced in their work is a set of three 
parameters called the resiliency triple. These include the 
multiplicity, the robustness, and the configurability, 
collectively denoted by (m, r, c). These parameters play an 
important role in the better utilization of a multiprocessor 
system, its resiliency to faults, and its suitability for various 
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fault recovery strategies. This paper presents methods 
developed to determine these parameters for an important 
computation graph, the path (pipeline), on two well known 
architecture graphs: the hypercube and the mesh. Without loss 
of generality, we shall assume an sxs square mesh and denote 
it by M,, where s is the number of nodes on each side. We will 


also use Q, to denote an n-dimensional hypercube and P, to 
denote a path that consists of k nodes. 


In the next section, the fault recovery model is briefly 
described. The resiliency triple is defined in Section 3, and its 
impact on task allocation and fault recovery is discussed. 
Section 4 presents the methods for determining each parameter 
in the resiliency triple for a pipeline computation structure (H) 
on two important classes of multiprocessor systems: the 
hypercube and the mesh (G). Section 5 gives the conclusions. 


2. THE FAULT RECOVERY MODEL 


In order to show how the parameters in the resiliency triple 
are related to fault recovery, we would like to briefly introduce 
the fault recovery model proposed by Harary and Malek [1]. 


In general, fault recovery models can be used in system 
synthesis or analysis. The synthesis involves the construction 
of an appropriate architecture with redundant components in 
order to meet a set of required conditions. An excellent example 
of the synthesis for fault recovery of cycles and binary trees 
can be found in [2]. In the analysis, a prescribed architecture 
graph such as a mesh or a hypercube is given and all fault 
recovery measures must be taken within this framework. In 


order to take advantage of the inherent component redundancy 


offered by a multiprocessor system, we decided to concentrate 
on the latter, which requires no imposed hardware redundancy. 
The fault recovery model is described as follows. 


Let an architecture graph G represent the physical architecture 
of a multiprocessor system. Nodes in this graph represent 
processors (or computers) and interface communication 
modules while edges indicate the actual point-to-point 
communication links. Each node in this graph can be extended 
to include memory, input/output channels, and other devices. 
Fig. 1 shows the architecture graph of an 8-processor 
hypercube. Let a computation graph H represent an actual 
computation (job) where each node corresponds to a task and 
each edge indicates the inter-task communications. The dark 
line in Fig. 1a shows a computation graph of a 4-node path 
mapped onto an architecture graph of an 8-processor 
hypercube. Since the computation graph has to be mapped onto 
the architecture graph, H is a subgraph of G. A faulty link 
leads to the removal of an edge from G and a faulty processor 
results in the removal of a node and the incident edges. When 
one or both of these cases occur, there are two possibilities: 
Either the resulting graph G' contains another subgraph H' that 
is isomorphic to H or it does not. If it does not, then the system 
G is called non-recoverable with respect to H and the particular 
fault(s). On the other hand, when G' does contain a subgraph 


H' isomorphic to H, and there are two or more such 
subgraphs, then the one yielding the minimum cost (such as 
some function of distance, time, or other parameters introduced 
in [1]) will result in the most efficient fault recovery. Fig. 1 
shows the recovery of a job on a hypercube system. 


faulty node 


ag 
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Fig. 1. The recovery of Py on a hypercube 


We observe that in general case a computation graph H is a 
digraph representing a task graph and one-to-one mapping of H 
onto G may not be possible. In such cases we resort to a 
concept of dilation which allows mapping of nodes that are 
adjacent in H onto G in such a way that the distance between 
nodes in G corresponding to adjacent nodes in H is equal to or 
longer than that in H. The general recoverability problem is 
NP-complete. In this paper we restrict G to be a mesh or a 
hypercube and H to be a path (pipeline). 


3. THE RESILIENCY TRIPLE 


The resiliency triple (m, r, c) consists of the following three 
parameters: multiplicity (m), robustness (r), and configurability 


(c). 


(a) m=2 


(b) r=8 
Fig. 2. The multiplicity, robustness, and configurability 
for Py on Q3. 


Multiplicity (Fig. 2a) is the maximum number of 
node-disjoint embeddings of H onto G, denoted by m(G, H). 


In graph theory [3], this is known as the node-disjoint packing 
number pac,(G, H) as introduced in [4]. When m=2, two 


identical jobs can be run simultaneously on the system (with 
some additional hardware) to allow single fault detection. 
When m>2, any single fault can be masked by voting the 
outputs of multiple copies of the same job. In other words, it 
allows N Modular Redundancy (NMR) in space. Usually, we 
are only interested in knowing whether m is equal to or greater 
than a chosen number in the range of 2 to 9. Multiplicity is 
also an indication of a system's fault tolerance. Given the 
necessary hardware, a system can be (m-1)-fault-tolerant. 
Higher multiplicity also allows more homogeneous jobs to be 
run on the system simultaneously to achieve better system 
utilization. Futhermore, testing can be performed by comparing 
results of the same job executed on different subsets of 
processors. 


Robustness, denoted by r(G, H), is the number of 


~ embeddin gs of a graph H onto a labeled graph G such that each 


node of H is at a different label of G in each embedding (Fig. 
2b). When r>1, fault recovery can be achieved through time 
redundancy by executing each stage of the computation 
(systolic array or pipeline) on two or more different processors 
at a time [5]. This corresponds to duplex or NMR in time. 
Again, we are usually concerned about whether r is equal to or 
bigger than a chosen number within the range of 2 to 9. Since 
multiplicity and robustness correspond to redundancy in space 
and time, they are also useful in system diagnosis. 


(1, 2, 1) (2, 3, 2) (3, 2, 3) 
(a) | (b) (c) 
Fig. 3. Equivalent configurations (fixed-labeling). 


Configurability (Fig. 2c) is the number of ways in which a 
particular job H can be configured on a system G. If H is a 
proper subgraph of G, there may be many ways to map H onto 
G. Each particular mapping, represented by a graph H, (c isa 


positive integer), is called a configuration. Since all 
configurations of H on G are isomorphic, the computation at 
hand can be performed using any of them. However, some of 
these isomorphic graphs are equivalent. Although isomorphism 
among a collection of configurations is itself an equivalence 
relation, we have, for reasons that will become obvious later, 
defined equivalence in a stricter sense. If all configurations are 
considered as "rigid" graphs, then two isomorphic 
configurations may not possess the same properties such as 
dimensionality and space occupancy. Two configurations H, 


and H, are equivalent if, after some necessary rotation and/or 
translation, H, either coincides with, or becomes a mirror 
image of H,. Fig. 3 shows some equivalent configurations of a 
4-node path P, on an 8-node hypercube Q3. The number of 


non-equivalent configurations of H on G is defined as the 
configurability of H on G and is denoted by c(G, H). Notice 
that each set of equivalent configurations is counted as one in 
deriving the configurability of a given computation graph on an 
architecture graph. The parameter, configurability, is a measure 
of several aspects of a multiprocessor system. Higher 
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configurability generally allows a better system utilization, a 
greater resiliency to faults, and a higher efficiency in fault 
recovery. The following examples demonstrate the importance 
of this parameter in a fault-tolerant multiprocessor system. 


Example 1: 


Consider the computation graph H to be a 5-node path P; and 
the architecture graph G to be a 16-node mesh My. Fig. 4a 
shows two nonequivalent configurations (H, and H,) of H on 


G. If these are the only configurations available, then at most 
two copies of H can be run simultaneously on the system G, as 
shown in Figs. 4a, 4b, and 4c. However, if we include the 
configuration H3 shown in Fig. 4d, then three copies of H can 


be mapped onto G as depicted. This means that either more 
jobs can be scheduled on the system to achieve better utilization 
or more copies of the same job can be run concurrently to 
obtain a higher degree of fault tolerance. In the above example, 
configurations shown in Figs. 4a, 4b, and 4c allow two copies 
of the same job to be run simultaneously so that any single fault 
can be detected. However, the situation shown in Fig. 4d 
allows three copies of the job to be executed simultaneously 
and therefore, any single fault can be masked. Observably, 
higher configurability results in higher multiplicity which 
allows the system to be more efficiently utilized. 


H, H H H, 
| 

(a) (b) (c) (d) 
Fig. 4. Mappings of P; on M4. 


Example 2: 


Consider the same computation and architecture graphs used 
in Example 1. If the two columns of processors on the left half 
of the system G are unavailable and the job H must be 
scheduled on the remaining processors, then H may be mapped 
on G as shown in Fig. 5a. Since the left half of the system is 
busy, fault recovery must be accomplished by reconfiguring 
the job around faulty components while using only the right 
half of the system. Fortunately, due to the existence of various 
configurations, the job H can be reconfigured to bypass any 
single faulty node. Figs. 5b through 5f show the possible 
configurations when the faulty nodes are as indicated. Clearly, 
higher configurability signifies a bigger chance of successful 
job reconfiguration and is thus an indication of the system's 
greater resiliency to faults. 


Example 3: 


Consider the same computation graph H (in previous 
examples) embedded onto the architecture graphs G, and Gp, 


which are a 3- and a 4-dimensional hypercube, respectively. 
The configurability of H on G, is two (c(G,, H) = 2) as shown 
in Figs. 6a and 6b, and that of H on G, is three (c(G,,H) = 3) 


as shown in Figs. 6a, 6b, and 6c. In Fig. 6a, if node A is 
faulty on Gj, then the task executed there can be transferred to 


node B. All other nodes can remain stationary for the job to 
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continue. Any other reconfiguration will result in the 
disturbance of more nodes. However, due to the existence of 
an additional configuration of H on G,, if this node is faulty on 
G,, then the task assigned to that node can be transferred to 


either node B or node C, depending on which node is available 
at the time. This example shows that higher configurability also 
indicates a more flexible system for reconfiguration. 


unavailable processors 
faulty node 


Hn 


a. 
x 
s 
S 
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(a) (b) (c) 
@) (e) (f) 


Fig. 5. Reconfiguration of Ps on M, with different faulty nodes. 


Fig. 6. Non-equivalent configurations of P; on Q, and Q,. 


Example 4: 


Consider the same computation graph H and architecture 
graphs G, and G, used in Example 3. If node A in Fig. 6b 
becomes faulty on G,, then not only does node A have to be 
transferred (to node B), but node D has to be moved (to node 
E) also in order to maintain the 5-node pipeline. However, if 
this node becomes faulty on G,, only that node has to be 


moved (to node F) to complete the reconfiguration. Therefore, 
higher configurability may also increase the efficiency of fault 
recovery. 


Since each parameter in the resiliency triple has some impact 
on the fault tolerance of a multiprocessor system, the study of 
these parameters is not merely of theoretical interest, but it is 
also useful in solving practical problems. Next, we present the 
methods developed for determining each of these parameters 
for a computation graph H on an architecture graph G. 


4. DETERMINING THE RESILIENCY TRIPLE 


The resiliency triple is dependent on both the computation 
and the architecture graphs. Since the pipeline is a very widely 
used computation structure, we have decided to start with the 
pipeline (path) as the computation graph under consideration. 
The binary n-cube (hypercube) and the mesh have both 
received much research and commercial attention and are useful 
for a wide range of problems. We therefore choose these two 
systems as the architecture graphs. 


4.1. Multiplicity 


A systematic way to map multiple node-disjoint copies of a 
path P, on a mesh M, or a hypercube Q,, is to concatenate as 
many P,'s as possible along a hamiltonian path. If a path P, 
is mapped onto a mesh Ms in such a manner, four 
node-disjoint copies will result. The multiplicity is, therefore, 
equal to four. Since a hamiltonian path exists in a hypercube or 
a mesh of any size, the multiplicity m(G, H), where H 
represents a path computation graph P, and G represents either 
a mesh M, or a hypercube Q,, system, is given by the following 
expression. 

m(G, H) = LN/k] 
Notice that Lx] is the largest integer smaller than or equal to 
x, N is the number of processors in the system, and k is the 
number of tasks in the computation (pipeline). N=2" for Q, 


and N=s~ for M,. For a P¢ on an M,;, N=25 and k=6. 
Therefore, m= [25/ 6|= 4. Furthermore, we may easily 


generalize and observe that for any architecture graph G with 
N nodes that has a hamiltonian path, the multiplicity is given 


by m(G, P,) = LN/kJ. 
4.2. Robustness 


The maximum number of mappings of a path P, on a mesh 
M, or a hypercube Q,, such that each node in a P, is assigned 
to a different node in an M, or Q,, can be obtained in the 
following manner. Starting with a mapping H, (H,= P,), the 
next mapping H, can be obtained by moving each node in H, 
to an adjacent node in the same direction along the hamiltonian 
cycle. The different mappings, H;'s (1<i<r, where r is the 
robustness) are obtained by sliding P, along a hamiltonian 
cycle in M, or Q,, one node at a time, until we return to the 
original mapping, H,. Fig. 2b shows such mappings of P, on 
Q3. We observe that robustness is equal to the number of 


nodes N in any system graph G if a hamiltonian cycle exists. 
Since a hamiltonian cycle exists in all hypercubes and all 
meshes with an even number of nodes, the robustness is 
given by r(G, H) = N for a P; ona Q, or an M, where s is 
even. However, if there are an odd number of nodes in a 
mesh (s is odd), then a hamiltonian cycle does not exist. In 
this case, we can choose one of the following alternatives 
which are easy to implement. 


(1) Find the largest cycle in M, and use it to generate the 
mappings as described above. 


(2) Slide P, along a hamiltonian path, one node at a time, 
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from one end to another. 


If we choose option (2), then robustness will be given by 
r(G, H) = N-k+1. Clearly, if we can find a cycle in M, 


such that N'>N-k+1, where N' is the number of nodes in this 
cycle, then option (1) will give a better result. 


We may observe that the largest cycle in M, with N nodes, 


where s is odd, contains N-1 nodes. Fig. 7 shows how such a 
cycle is constructed on an arbitrary M, (s is odd). 


Consequently, using option (1), the robustness is given by 
r(G, H) = N-1 for P, on M, when s is odd. This is 


optimum (for k > 2) since a hamiltonian cycle does not exist. 


CTT Ty 
‘ ‘ 
4 b 
4 4 » 
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Fig. 7. The largest cycle on Mg when s is odd. 


4.3. Configurability 
4.3.1. Configurability of a Path on a Hypercube 


In order to determine configurability, we need to generate 
various non-equivalent configurations. Before presenting an 
algorithm to accomplish this, we shall describe a scheme 
which is suitable to represent a configuration of a path P, ona 


hypercube Q,. Since P, has k-1 edges, a logical way to 


represent the path is by a vector of k-1 positive integers. Each 
integer indicates the dimension in which the corresponding 
edge resides. Since non-equivalent configurations are not 
distinguished by their positions or orientations in the system, 
and the hypercube is a symmetric graph, we need not adopt a 
fixed coordinate system. Furthermore, it is more 
advantageous not to assign a fixed integer to each dimension. 
This can be demonstrated by the following example. 


Example 5: 


In a Q3 system, if the dimensions are labeled such that the 


horizontal dimension (x-coordinate) is denoted by 1, the 
vertical dimension (y-coordinate) is denoted by 2, and the 
remaining dimension (z-coordinate) is denoted by 3, then the 
configuration in Fig. 3a will be represented by the vector (1, 
2, 1), and those shown in Figs. 3b and 3c will be represented 
by the vectors (2, 3, 2) and (3, 2, 3) respectively (for easier 
reference, we have chosen the lower left node as the starting 
point of the path). Obviously, all three configurations are 
equivalent and should be counted as one. But many such 
equivalent configurations will be generated as different vectors 
if each dimension is assigned a fixed integer. However, if we 
label every dimension dynamically, according to the order in 
which they are traversed by the path, then all the above three 
configurations will be represented by the vector (1, 2, 1). Fig. 
8 shows all the non-equivalent configurations of P, on Q, 


using this representation. Since there are four non-equivalent 


configurations, the configurability of P. on Q3, c(Q3, Pg), is 
equal to four. 


The configurability of H on G (P, on Q, here) can be 


obtained by enumerating all the non-equivalent configurations 
of H on G. But since the distinct configurations themselves 
are also very useful for task allocation and fault recovery, we 
want to generate and save the vectors which represent them. 
When there are many non-equivalent configurations of H on 
G, we may decide to save only a chosen number, say x, of 
them in order to save time and memory space. In this case, we 
are only interested in knowing the exact number when 
configurability is smaller than x. 


pal 
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Fig. 8. Path representation using dynamic labeling. 


Before presenting the algorithm for generating 
non-equivalent configurations of P, on Q,, let us discuss 
some related issues. Firstly, we observe that when k=2 or 
k=3, there is only one configuration, represented by the 
vectors (1) and (1, 2) respectively. As a result, to find a vector 
that represents a configuration of P, (k-1 edges) on Q,, 


where k>3, we only need to generate k-3 integers to be 
appended to the vector (1,2). Secondly, for any path mapped 
on a hypercube, no two adjacent edges can lie in the same 
dimension. This means that in the vector representing a 
configuration of P, on Q,,, adjacent integers cannot be equal. 
Consequently, given an edge represented by an integer i, the 
next edge in the path can only be represented by an integer j 
such that 1<j<n and j#i. In other words, we can only choose 
from n-1 integers to represent this edge. Having observed 
this, it is clear that we may generate up to (n-1)*-> vectors to 
be candidates for the configurations of P, on Q,. Many of 
these vectors represent configurations that contain cycles and 
must thus be eliminated. Other vectors may represent 
equivalent configurations and must thus be counted as one. 
Fig. 9 shows two configurations with cycles. Observably, 
these configurations are also equivalent. If we have to test 
each of the (n-1)k-3 vectors for cycles and equivalence, the 
O((logN)k-3) computing time may be excessive (N=2" is the 
number of processors in Q,,). For a typical case of n=10 and 


k=10, the number of operations becomes 97=4.782,969. 
However, if we take another approach by building the 
configurations of P, from those of P,_,; on Q,, then only 
2(n-1)c(Q,,, Py_;) vectors will be generated. This is because 
we can build a path of k nodes by appending a node either to 
the front or to the end of a (k-1)-node path. Knowing that the 
configuration of P, is (1, 2), we can extend the path one edge 


at a time until we reach P,. As a result, only 
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k 

S'2(n-1De( Q,, Pj.) vectors need to be generated to obtain 
i=4 

all the configurations of P, on Q,. If we put an upper bound, 
x, on the c(Q,, P;_,)'s, O(klogN) computation time is 
required for the whole operation. The latter approach is also 


more efficient in terms of memory requirement O(k) memory 
space is required instead of O((logN) 3), which is necessary 


for the former approach. 


starting 
point 


CL 231, 2,3) 
(b) 
Fig. 9. Configurations with cycles on a Q,. 


After generating the configuration vectors mentioned above, 
we need to perform the eligibility test. This consists of testing 
for the existence of cycles and equivalent configurations. In 
order to detect cycles, let us observe that any cycle on Q, is 
represented by a vector in which every integer appears for an 
even number of times. Any vector that corresponds to a 
configuration which contains a cycle must have a subvector 
that exhibits the above characteristic. Let us scan a vector from 
the left to the right and keep an n-bit binary number as a parity 
indicator, in which the i-th bit (b,) indicates whether the 


integer i has appeared for an even number of times. If it has, 
b; is set to 0; otherwise b; is set to 1. For example, if we 
consider the vector (1, 2, 1, 2, 3) in Fig. 9b, the 3-bit parity 
indicator (P=b,bjb3) will be updated as follows when we 
scan the vector from the left to the right one bit at a time: 


step 1: 100 (1 appeared once) 
step 2: 110 (2 appeared once) 
step 3: 010 (1 appeared twice) 
step 4: O00 (2 appeared twice) 


After scanning the fourth integer in the vector, the parity 
indicator becomes zero, indicating the detection of a cycle. 
Notice that if the first edge in the configuration is part of a 
cycle, as shown in Fig. 9b, then the parity indicator would 
become zero as soon as a cycle is detected, even though there 
is still another edge attached to the cycle. In this case, there is 
no need to scan the rest of the vector. However, if the first 
edge is not part of the cycle, as shown in Fig. 9a, then the 
parity indicator would not go to zero if the vector is scanned 
as a whole. In this case, the cycle will be detected when the 
subvector (2, 3, 2, 3) is scanned. Thus, cycle detection 
requires scanning the configuration vector and its subvectors 
while updating and checking the parity indicator. This requires 
O(k?) computations for P, ( O(k) if all the subvectors are 


checked in parallel). 


The way to test for equivalence is by observing that any 
configuration can be traced from either end. Consider the 
configuration in Fig. 9a. If the configuration is traced in the 
order ABCDEB, , then we get the vector (1, 2, 3, 2, 3). But if 
we traverse in the opposite direction starting with node B, 
then the vector obtained would be (1, 2, 1, 2, 3). How do we 
derive this vector from (1, 2, 3, 2, 3) and thus detect the 


equivalence? The answer is by inverting the vector (i.e. listing 
a given vector by starting with the last element) and 
renumbering the resulting vector as follows. 


H, = (1, 2, 3, 2, 3) 
H,'= INVERT (H,) = G, 2, 3, 2, 1) 
H, = RENUMBER(H;,’) = (1, 2, 1, 2,3) 


cH, = H, 


Observably, the operation RENUMBER performs the 
following mapping: 


399 1 22, “ts. 

However, it may perform a different mapping in a different 
situation. Its job is to scan the current vector and relabel each 
integer according to the order in which it appears in the vector. 
In Hy’, the integer 3 is the first label to appear in the vector, 
and is thus given a new label 1. Similarly, the integer 2 is the 
second label and 1 the third appearing in H,', they are 
therefore reassigned the new labels 2 and 3 respectively. Since 
we have not assigned a fixed number to any particular 
dimension in Q,, inverting and then renumbering a vector 
would not produce a new configuration, and is equivalent to 
tracing the path from the opposite end. Renumbering is also 
necessary when a configuration vector for P, is generated by 


appending an interger to the front of a vector of P,_,. For 
example, if we want to obtain a configuration for Pg by 
appending an edge to the front of the P; shown in Fig. 10a, 
we can append an integer (either 2 or 3 in this case) to the 
front of the vector that represents the P5. If we choose to 
append a 2, the following vector is obtained: H'(P¢) = (2, 1, 


2, 3, 2). The corresponding configuration is shown in Fig. 
10b. H'(P¢) needs to be renumberred as follows: 


2 RN Ae. Oe Oy 
H(P¢) = RENUMBER( H'(P¢)) = (1, 2, 1, 3, 1). 


H(Ps) = (1, 2, 3, 2) H(P¢) = (1, 2, 1, 3, 1) 
(a) (b) 
Fig. 10. Extension of a 5-node path. 


Having discussed the various issues involved in generating 
the configurations of P, on Q, we are now ready to present 


the algorithm which enumerates up to x non-equivalent 
configurations of P, on Q,. The final configuration vectors 
are stored in an x by (k-1) array, H(1:x, 1:k-1), which 
contains up to x vectors of k-1 integers, each of which 
represents a unique configuration. Then H(m, 1:k-1) would 
correspond to the (k-1)-integer vector representing the m-th 
configuration enumerated. An array T(1:x, 1:k-1) is used to 
save the intermediary vectors (for P._,'s). The algorithm is as 
follows: 
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Algorithm 1: 


Input: n, k, x. 
1. If k=2, exit with H(1)=(1), c=1. 
2. If k=3, exit with H(1)=(1, 2), c=1. 
3. If k>3, set H(1,1)=1 and H(1,2)=2: h=2. 
\h keeps track of the largest integer in the vector and hsn \ 


4. Until H contains vectors of k-1 elements, set T = H, erase 
H, and do the following: 
A. For every vector v in T, do the following: 
a. If h #n, set h=h+1. 
b. For every integer i such that 1<i<h andi #j, do the 
following: \j is the last integer in v \ 
i. Append i to the end of v. | 
ii. Perform cycle detection. If positive, go to 4b. | 
iii. Invert and renumber the vector; check if the resulting 
vector has been saved in H. If positive, go to 4b. 
iv.Save the resulting vector in H. If IH! = x, go to 4; 
otherwise go to 4b. 
\ IHI is the number of vectors in H\ 
c. For every integer i such that 2<i<h, do the following: 
i. Append i to the front of v and renumber the vector. 
ii. Check if vector exists in H. If positive, go to 4c. 
iii. Perform cycle detection. If positive, go to 4c. 
iv. Invert and renumber the vector; check if it exists in H. 
If positive, go to 4c. 
v. Save the resulting vector in H. If IH! = x, go to 4; 
otherwise go to 4c. | 
5. If IH! < x, set c= IH! and output("c is equal to", c); 
otherwise, output("c is at least", x). 


In the above algorithm, Step 4 is executed k-3 times. For 
each iteration of Step 4, Step 4A is invoked at most x (a 
constant) times, each of which causes Steps 4b and 4c to be 
performed h times. Steps 4b and 4c perform cycle detection 
and vector renumbering, which require O(k~) computing time. 
Consequently, the total time requirement for Algorithm 1 is 
O(kh). By observing that h is O(k) if kSn+1 and is O(n) if 
k>n+1, we conclude that the computation time is either O(k*) 
or O(k3logN). If we assume an upper bound on the size of the 
pipeline so that k<K, where K is a chosen constant, then the 
computation can be accomplished either in a constant (O(1)) 
time or O(logN) time, depending on the values of k and n. 
Since two x by (k-1) arrays are used to store the final and the 
intermediary results, the memory requirement for Algorithm 1 
is O(k). Again, for kSK, this means O(1) storage space. 


4.3.2. Configurability of a Path on a Mesh 


Before discussing the method of representation for a path P, 
on a square mesh system (M,), it is helpful to observe the 
following differences between a square mesh and a hypercube 


(Q,): 


1. The Q,, is a regular graph in which every node has the same 
degree, n. But the M, has four corner nodes, which are of 


degree 2, and 4s - 8 boundary nodes, which are of degree 3. 
The remaining nodes all have a degree 4. 


2. A configuration of P, in any particular position on Q,, can be 
rotated n-1 times before returning to its original orientation. A 
configuration of P, on M, has only three rotations besides 


itself. It also has two mirror images, one along the x-axis 
(horizontal) and the other along the y-axis (vertical). 


We have decided to ignore the boundary cases on M, in order 
to simplify the discussion. This requires that k<s, which is 
usually satisfied. However, if k>s, then the number of edges 
traversed in each direction must be counted so as not to exceed 
s. Since each node under consideration has a degree of 4 
regardless of s, and given an edge in a path, the next edge to be 
traversed can only be oriented in one of three directions (two 
adjacent edges cannot be traversed in opposite directions on a 
mesh), we have chosen to assign a fixed integer to each of the 
four directions, as shown in Fig. 11a. Then a configuration of 
P, on M, can be represented by a (k-1)-element vector as 
shown in Fig. 11b. We may require that the first integer be 1 
(first edge always heads to the right) for easier reference. When 
k=2, there is only one configuration, which is represented by 
(1). When k>2, we need to find an additional k-2 integers to 
complete the (k-1)-edge path. Because each integer can assume 
one of three values as mentioned earlier, up to 3k-2 vectors may 
be generated. These include many cycle-bearing or equivalent 
configurations. For a typical case of k=10, 38=6561 operations 
are required. Although this may be acceptable, we can improve 
the time efficiency by using a method similar to the one 
described in Section 4.1. A configuration for P, can be 


obtained from that of P. by appending an edge to either end of 


the latter, step by step until k-1 edges are accumulated. To 
obtain P; from P;_,, 6c(M,, P;_,) vectors are generated. Thus, 


k 

> 6c(M y Pj-7) Operations are required for the complete 

i=3 

process. If we again put an upper bound on the number of 
non-equivalent configurations for each P., where 2<i<k, then 


O(k) computing time is needed (O(1) if k<K). 


Start 
(1, 3} 3; 1, 2, 1, 2; 4) 


(b) a path 


2 


(a) labeling each direction 


V =(1, 1, 1, 2, 4, 2, 4, 3, 4, 3) 
(c) acycle 


Fig. 11. Representation of a path and a cycle on a mesh. 


The eligibility test for configurations of P, on M, also 


consists of cycle detection and equivalence test. However, the 
particular methods to accomplish these are different from those 
presented in Section 4.1. The following theorem can be used 
for cycle detection. 


Theorem 1: Given the labeling scheme in Fig. 1la and a 
vector Vv representing a configuration on a mesh M,, if we let 
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sum denote the sum of all the integers in v and length denote 
the number of integers in v, then the configuration is a cycle iff 
sum = 2.5 x length. 


Proof: If a cycle on a mesh is traversed starting from an 
arbitrary node, each edge in the cycle would belong to a pair of 
edges pointing in opposite directions. A cycle of length edges 
consists of /ength/2 such pairs. Since in the introduced 
labeling scheme (Fig. 11a) each direction is numbered in such 
a way that integers representing opposite directions add to 5, 
each pair of these edges are denoted by integers adding to 5. 
Since there are length/2 pairs, the sum of all the integers, each 
representing an edge in the cycle, is 5xlength/2, or 


2.5xlength. CO 


Fig. 1lc shows an example in which v=(1,1,1,2,4,2,4,3,4,3). 
From this we get sum =1+14+14+24+44+2444+34+4+3=25, 
length=10. Applying Theorem 1, a cycle is detected. A vector 
corresponding to a cycle-bearing configuration would have a 
subvector that demonstrates the above characteristic. 


We have made the following observations on equivalent 
configurations of P, on M,: Consider the example shown in 


Fig. 12a. If the path is traced starting from node A, then the 
following vector is obtained: H, = (1, 2, 2, 4, 4, 3, 1). But if 
node B is the starting point, then we would get 

H,' = (4, 2, 1, 1, 3, 3, 4). We know that H, and H,’ are 
equivalent, but how do we detect the equivalence? Since we 
have chosen | to be the first integer in all vectors, H,' needs to 


be renumbered. In order to convert 4 to 1, we realize that the 
edges heading left must be forced to head right. Since mirror 
images are equivalent, we can convert H,' to its mirror image 


along the y-axis, causing the horizontal edges to exchange 
directions. As a result, the integers 1 and 4 are interchanged, 
giving the vector H,"=H,=(1, 2, 4, 4, 3, 3, 1). Fig. 12b 
shows the corresponding configuration. Similarly, the mirror 
image of a configuration along the x-axis causes the vertical 
edges to exchange directions, resulting in the interchanging of 
2 and 3 in the corresponding vector. Fig 12c shows such a 
mirror image (of the path in Fig.12a). Fig. 12d shows the path 
in 12a reflected twice, once along the x-axis and once along the 
y-axis. The corresponding vector is obtained by interchanging 
1 and 4 as well as 2 and 3. This is equivalent to subtracting 
each integer in the original vector from 5. Clearly, 
interchanging 1 and 4 or 2 and 3 in a vector does not result in a 
new (non-equivalent) configuration. Finally, let us observe the 


configurations in Figs. 12e and 12f. These are both 90° 
rotations of Fig. 12a, one clockwise and the other 


counterclockwise. When a configuration is rotated 90° 

clockwise, integers in the original vector must be renumbered 

according to the following: 
172, 274, 3-1, 


43 (2) 


When a configuration is rotated 90° counterclockwise, the 
vector must be renumbered as follows: 

i 3,. 21, 34, 4°22 (3) 
Thus, renumbering a vector according to (2) or (3) would not 
alter the configuration (all resulting configurations are 
equivalent). After observing the above, it is readily seen that 
whenever we have a vector whose first element (i) is not 1, the 


vector can be renumbered (to begin with 1) as follows: 


1. If i=2, then reassign integers according to (3). 
2. If i=3, then reassign integers according to (2). 
3. If i=4, then interchange 1 and 4. 


As mentioned earlier, two adjacent edges cannot be in 
opposite directions. Therefore, when appending an edge to the 
front of an existing path (represented by a vector that begins 
with 1), only three choices are available. The edge may be 
represented by one of the following three integers: 1, 2, or 3. 
Whenever a vector is inverted or extended at the front, 
renumbering may be required. A configuration (H, in Fig. 12a) 


and its mirror image (H;, in Fig. 12c) along the x-axis both 


have vectors that begin with 1. It is therefore necessary to 
check for these equivalent configurations. 


Starting point 


Aa A 


H1= (1, 2, 2, 4, 4, 3, 1) H»= (1, 2, 4, 4, 3, 3, 1) 


Hs5= (2, 4, 4, 3, 3, 1, 2) Hg = G, 1, 1, 2, 2, 4, 3) 
Fig. 12. Mirror images and rotations of a path on a mesh. 


Now we are ready to present the algorithm for enumerating 
up to x ( a chosen constant) non-equivalent configurations of 
Py, on Mg. As in Algorithm 1, two arrays, H(1:x, 1:k-1) and 


T(1:x, 1:k-1), are used to store the final and the intermediary 
vectors, respectively. The algorithm is as follows. 


Algorithm 2: 


Input: s, k, x. 
1. If k=2, exit with H(1)=1, c=1. 
2. If k>2, set HC, 1)=1. 
3. Until H contains vectors of k-1 elements, set T=H, erase 
H, and do the following: 
A. For every vector v in T, do the following: 
a. For every integer i such that 1<i<4 and i+j#5, do the 
following: \j is the last integer in v \ 
i. Append i to the end of v. 
li. Perform cycle detection. If positive, go to 3a. 
iii. Invert and renumber the vector; check if it exists in H. 
If positive, go to 3a. 
iv. Interchange 2 and 3 in the vector; check if the resulting 
vector exists in H. If positive, go to 3a. | 


v. Invert and renumber the vector; check if it exists in H. | 


If positive, go to 3a. 
vi. Save the vector in H. If |H! = x, go to 3; otherwise, go 
to 3a. 


b. For every integer 1 such that 1<i<3, do the following: 

i. Append i to the front of v and renumber the resulting 
vector. Check if it exists in H. If positive, go to 3b. 

ii. Perform cycle detection. If positive, go to 3b. 

iii. Invert and renumber the vector; check if it exists in H. 
If positive, go to 3b. 

iv. Interchange 2 and 3 in the vector; check if the resulting 
vector exists in H. If positive, go to 3b. 

v. Invert and renumber the vector; check if it exists in H. 
If positive, go to 3b. 

vi. Save the vector in H. If [HI = x, go to 3; otherwise, go 
to 3b. | 

. If IH! < x, set c = IHI, output("c is equal to", c). 
Otherwise, output("c is at least", x). 


In Algorithm 2, Step 3 is repeated k-2 times. Each iteration of 
Step 3 causes Step 3A to be executed up to x times, each of 
which in turn performs Steps 3a and 3b three times. Steps 3a 
and 3b each requires O(k2) computing time. As a result, the 
total time requirement for Algorithm 2 is O(k3). If k < K, 
where K is a constant upper bound on k, this reduces to O(1). 
Like Algorithm 1, the memory requirement is O(k), or O(1) if k 
is bounded. 


5. CONCLUSIONS 


We have defined three parameters (the resiliency triple) and 
discussed their importance in the fault tolerance and diagnosis 
of multiprocessor systems. We have also presented the 
solutions for determining the first two parameters, multiplicity 
and robustness, and two algorithms which determine the 
configurability and enumerate various non-equivalent 
configurations for a path computation graph mapped onto a 
hypercube or a mesh architecture graph. The resiliency triple is 
a good measure of a multiprocessor system's resiliency to 
faults and its flexibility for job reconfiguration. It is also related 
to the system utilization and fault recovery. The configurations 
generated by the algorithms proposed in Section 4.3 are useful 
for efficient task allocation as well as effective fault recovery. 
The efficiencies of these algorithms have been improved by 
carefully choosing a path representation scheme in each case, 
and by selecting an effective approach to generating the 
configurations. For bounded k, Algorithm 1 has O(logN) 
computing time, and requires O(1) storage space, and for 
Algorithm 2, both time and memory requirements are constant. 
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Abstract— An encoding technique, the weighted checksum code 
(WCC), is proposed to achieve concurrent error detection in matrix 
arithmetic and signal processing on highly concurrent VLSI structures. 
In order not to increase the roundoff errors when we incorporate the 
WCC into the computation, a simple roundoff error analysis is used to 
guide the construction of the WCC. A new data retry technique is then 
proposed to locate the faulty processors and identify the correct outputs. 
Such an approach provides rapid error detection with low hardware 
overhead while system performance is not significantly degraded for the 
sake of fault tolerance. 


Zl. Introduction 


Many algorithms for digital signal and image processing, such as Fast 
Fourier Transform (FFT), Finite Impulse Response filters (FIR), 1-D 
convolution, 2-D convolution [1], and feature extraction and pattern 
classification [2], require large-scale matrix or vector computations in 
their solutions. Fast matnx algorithms for solving large-scale matrix 
computations have been proposed by Kant and Kimura [3], Sameh and 
Kuck [4], Hwang and Cheng [5], and many other researchers. Also, 
many existing architectures consisting of array-structured machines, 
such as ILLIAC IV, MPP (Massively Parallel Processor) [6], and 
systolic array processors [1,7] have been proposed to solve these 
problems effectively. A major difficulty with a high degree of 
integration is that a single flaw in a chip can render an entire computing 
system useless. It is, therefore, desirable to have a high-performance 
system which can also tolerates physical failures in the system by 
providing correct results, or one which can at least detect the error, 
restructure the system, and retry the computation. 


An encoding technique, the weighted checksum code (WCC), was 
proposed in [8] to achieve both error detection and correction for matrix 
operations using highly concurrent VLSI computing structures. This 
technique is very cost-effective and valid when fixed-point number 
systems are employed. Since roundoff errors may destroy the error 
correction capability of WCC, it may be difficult to apply this technique 
alone on floating-point number systems. In this paper, the Weighted 
Checksum Code (WCC) will be used to perform concurrent error 
detection (CED) fast and cost-effectively. A simple roundoff error 
analysis is used to guide the construction of the WCC such that the 
roundoff errors will not increase due to the incorporation of the WCC 
into the computation. A new data retry technique is then proposed to 
locate the faulty processors and identify the correct outputs. Such an 
approach provides rapid error detection with low hardware and time 
overhead compared with previous attempts at using hardware for error 
correction [8]. Once an error is detected (a relatively rare event in 
practice), additional time steps are used for fault location. Thus, system 
performance is not significantly degraded for the sake of fault tolerance. 
Large roundoff errors are detected and treated in the same manner as 
functional errors. However, the data retry technique can also distinguish 
between the roundoff errors and functional errors which are caused by 
some physical failures. The proposed scheme, error detection by 
hardware redundancy method and error correction by time redundancy 
method, is thus cost-effective and valid for both fixed-point and 
floating-point number systems. | 


* This research was supported by the Semiconductor Research Corporation under 
Contract SRC-RSCH84-06-049. 


** Professor Abraham is with Coordinated Science Laboratory, University of Illinois at 
Urbana-Champaign. 


For simplicity of treatment, this discussion will be based on linear array 
architectures which are believed to hold the most promise in VLSI 
computing structures for their flexibility, low cost, and applicability to 
most of the interesting algorithms. A similar discussion clearly holds 
for two-dimensional array architectures as well. 


In Section 2, a module-level fault model applicable to VLSI is 
described. In Section 3, the matrix encoding technique is reviewed. 
Section 4 discusses the effect on the word length and the roundoff error 
analysis. In Section 5, a concurrent error detection scheme using the 
weighted checksum is proposed. Section 6 describes the faulty 
processor identification and error correction procedures. In Section 7, a 
procedure for obtaining correct data and identifying faulty processors is 
described for systems with multiple faulty processors. 


2. The Fault Model 


In this paper, we allow a module (such as a processor or computation 
unit in a multiple processor system) to produce any arbitrary logical 
errors under failures. We also assume that, at most, one module is 
faulty within a given period of time, which will be relatively short 
compared to the mean time between failures. In Section 7, systems 
with multiple faulty processors are also discussed. 


Since effective error correcting schemes, such as Hamming codes [9] 
and Alternate-data retry [10], exist for communication lines and 
memories, we will assume that failures in the communication lines and 
memories are detected and corrected by those methods. In this paper, 
we will, therefore, focus on the fault tolerance of the processor array. 


3. The Weighted Checksum Encoding Scheme 
Let us denote a matrix H, the WCC-matrix, as 


Wii Wi2 Win -1 0 --- O 

Wa, W Wr, 0 -l --: O 
H= 

Wii Win °° Wm O ODO cee =I 


Using this, a compact description for the Weighted Checksum Code 
(WCC) can be given in terms of matrices. Readers can refer to [8] for 
detail. 


Definition 1: Let H be a t—by—(n+t) matrix of numbers. Then the set 
of (n+t)-element vectors that satisfy the matrix equation HA = 0 is 
called the code space of H, where A is a column vector. 


Theorem 1: The code vector of a WCC-matrix H has at least d nonzero 
elements (or is a distance-d code) if and only if every combination of 
d—1 or fewer columns of H are linearly independent. 


From Theorem 1 and the construction principle of the WCC code, we 
can construct a WCC with a suitable capability by suitably assigning the 
weights of the matrix H. Since a single module level fault model 
applicable to VLSI has been assumed, a WCC-matrix H, where 
H={[11..1-1], will be used to demonstrate the encoding 
technique and develop the theory. This specific distance-2 code which 
is a subset of WCC and is simply called the checksum code, will be 
used to efficiently achieve concurrent error detection (CED). However, 
the fact holds for the general distance-t+l1 Weighted Checksum 
Encoding Matrices whose WCC-matrices satisfy the requirement of 
Theorem 1. 
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Assume in the following discussion that A is an n ~by—m matrix. Since 


n can be | or m can be 1, vectors are defined in the same way as 
matrices. Let us define: 


eT= [11 +++ 1] and fP=[11--- 1], 


where e is a n—by—1 column vector and fis am—by-1 column vector. 


Definition 2: The column, row and full checksum matrix A,, A,, and 
Ay of the matrix A are defined as: 
Af 


A 
A, = A A, = [A Af], A; = . pA 


It can be seen that five matrix operations exist which preserve the 
checksum property: addition, multiplication, LU-decomposition, 
transpose, and product of a matrix with a scalar. They are given in the 
following theorems without proofs. 


A 
TA 


Theorem 2; If B, ..B,A = C, then B, .. B,A, = C,. - 
Corollary 1: A.B, - 


Corollary 2: A,B, = Cr. 


Theorem 3: If A+ B= C, then A, + B,= C,, Ap + B, = C,, 
Ar+ By = Ce. 
Theorem 4: If sA = C, where s is a scalar, then SA, = C,, SA, = Co, 


SA, = Cy. 
Theorem 5: If A" = C, then Ay = C., A? = C,, AT = C,. 


Theorem 6: When the matrix A is LU decomposable, the full 
checksum matrix of A, Ay, can be decomposed into a column checksum 
lower matrix and a row checksum upper matrix such that A p= L.U,. 


4. Effect On The Word Length 


When processing the matrix with fixed-point systems, the definitions of 
the summation elements may be modified to use residue arithmetic and, 
thus, avoid very large checksums [8,11]. Then, 
j=m 
di msi = Ya;j; modM forl< isn 
ile 


Anuj = Ya; mod M forl< j< m, 
i=] 


where M= 2°. We assume that all the numbers lie in the range 
~Ilsasl. 


In floating-point number systems, the modified definitions are: 


_ [logam| j=m : 
Qi m-t = 2 » 4, ; for Ils isn 
= 1 
_ [logan] i=n . 
Anti, j = 2 Qj, j for 1< Is Mm, 
i= 1 


The reason for the modified definitions of floating-point number systems 
will be discussed later in this Section. 


In floating-point arithmetic, each number x is represented in the form 
x = m2°, where m is called the mantissa and e the exponent. We 
assume that 1/2< |m|< 1 for normalized floating-point arithmetic 
operations, where |m | is the absolute value of m. We denote / as the 
number of binary digits allocated both to a fixed-point number and the 
mantissa of a floating-point number, and u as the machine-dependent 
unit roundoff. For example, u = 2~’t in a /-digit fixed-point system. 


Following [12], we will use fi(.) to denote the computed fixed-point 
result of the argument and ff(.) to denote the computed floating-point 
result of the argument. The equivalence sign will be used to emphasize 
that rounding errors have been taken into account. We have 
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A(xty) = xty, fl(xty) = x(1+e) + y(14e), 
fy) = xy + 8, fly) = xy(1+8). 


where |e|, 15| < u. In the computations, we assume that the 
computed results do not lie outside of the permitted range. 


Let’s discuss the case of fixed-point number Systems first. Assume that 
the equation we want to calculate is A 


b; = 


ntl, mX&m, 1 = Bro, 
SiG; 1x1 + GQi2X2t °° + A; mXm) 
GiiX1 + Gi 2X2 +. + Gi mim t 8 + & +..4 On 


If we do the checksum verification for the vector B, the error bound E is 


B= | So, — baa |S (n+1)mu. 


f=] 


The error bounds of the fixed-point systems only depend on the size of 
the problems. | 


Let’s discuss the case of floating-point number systems. 
bj = fl(aj1%1 + a;aX2 + 07° + Gj, m%m) 
Gj, 1X1 (1+ 5, (1+ €)(1+ &2) .. (1+ &m— 1) 
Gi, 2X2(1+ 8, )(1+ € (1+ €2) .. (1+ &m— 1) 

+ Ay mXm(1+ bn A+ En 1)- 


++ il it 


The error bound £ of vector B is 
n 
E= | 5; - Dna | S$ (n+1)mu | |x | es | | amex | ee 
ae 


Thus, in floating-point systems, the error bound is affected by 
| |a; | |,. In order not to increase the roundoff error bound when we 
incorporate checksum techniques into the computations, we can use 
- t login] ” 
Ans, j= 2 Lai, 
| 


ee 


j 
because 


s 


| lana | l,= 2 | | anax | he 


n 
|| ya; | |, 
i=l 


5. Concurrent Error Detection 


We have seen in Section 3 that checksum matrix operations produce 
code outputs which provide some degree of error-detecting or correcting 
capability. However, a faulty module may cause more than one element 
of the result to be erroneous if it is used repeatedly during a simple 
matrix operation. This problem is solved by using multiple processors 
and scheduling each processor to calculate only a few data elements. 
The errors caused by a faulty processor are then confined either to a few 
data elements or to only one element. In this manner, the checksum 
technique incorporated into matrix operations can detect or correct 
errors caused by a faulty module. 


In this paper, weighted checksum technique is only used to achieve 
concurrent error detection capability. An new data retry technique, 
which will be described in Sections 6 and 7, will thus be used to 
achieve error correction and faulty processor identification. 


In the following discussion, the matrix-vector multiplication will be 
used as an example to demonstrate the concurrent error detection 
scheme. It is obvious that the concurrent error detection scheme can 
apply to a variety of matrix operations and signal processing algorithm 
as well. 


When matrix operations are performed on a computer system or with 
special-purpose hardware, roundoff errors due to finite word length are 
hard to avoid whenever fixed-point or floating-point arithmetic is used. 
A small difference, n, which can be decided by simulated results or 
analytic error bounds, as discussed in the previous section, must be 
allowed for when checking for equality. If the analytic error bounds are 
used as n, then any functional errors which affect the outputs of the 
computations more than the analytic error bounds will be detected. In 
practice, 1 should be chosen between zero and the analytic error 


bounds. Thus, large roundoff errors can also be detected and treated in 
the same manner as a computing unit with a functional fault. The 
functional errors which affect the outputs of the computations much less 
than n will, of course, not be detected, but these will not affect the 


results significantly. The best choice of ™ may depend on the 
applications and will not be discussed in this paper. Thus, in the 
following discussion, we will only outline the concurrent error detection 
scheme. 


Many signal and image processing algorithms such as FIR and DFT [1] 
involving a "multiply-and-accumulate" type of expression can be 
formulated as matrix-vector multiplication problems. Figure 1 shows a 
linear processor array; the input data streams show the multiplication of 
a 5-by-4 column checksum matrix with a 4-by-1 vector. The operation 
of the array is described as follows: we want to calculate the equation 
Ansi,n Xn,1 = Basi,i1- The matrix element a; ; 1s stored in the local 
memory of the ith processor at the jth time step, and x; is broadcast to 
each processor at the jth time step, as shown in Figure 1. Each 
processor multiplies n pairs of a; ; and x; and accumulates the products 
in a register. Each processor thus calculates one element of the result 
vector and the faulty processor affects only one element. Any error can, 
therefore, be detected by using checksum scheme. 


V4 O44 au O44 
443 @43 a3 G43, Ay 


a Gyn As 


TL 
11 Oo ut 


by by 


Figure 1. Checksum matrix-vector multiplication 


After the result vector is obtained, one n-operand adder can be used to 
calculate the checksum. Define the overhead ratio as the ratio of the 
time, hardware, or delay overhead required by the CED technique to the 
time, hardware, or delay complexity of the original system without 
CED. Since the whole computation, including the checking, is 
pipelined, there is no time overhead in terms of performance. However. 


the delay overhead ratio is O( “ / n), and the hardware overhead ratw 


is O((n/ 1+ 1)/ n)), where / is the word length. If n = /, then the delay 
overhead ratio is O(1/n) and the hardware overhead ratio is O(2/n). In 
floating-point systems, since the execution time of addition is 

to that of multiplication, the delay overhead ratio becomes 


comparab 
0 ltoga / n). 
6. Error Correction By Data Retry 


In this section, we will propose a technique for the correction of 
erroneous results using time redundancy. This technique also enables us 
to distinguish between functional errors and large roundoff errors 


Assume we want to calculate the equation A,, X,,; =B,,. Our 
approach is that on the second try, we assign the computation step of 
each element of the output vector B to a processor which is different 
from the one used before. For example, in the first run, the 
computation of b; is assigned to processor one, b» to processor two... 
and b, to processor n. In the second try, the computation of b, is 
assigned to processor 2, b» to processor 3 .. . and b, to processor |. 
(In order to simplify the notation, we will use i+ 1 to represent 
imodn+ 1 in the following discussion.) After two tries, each 


361 


element of the vector has two results (not necessarily different). We 
then compare the two results. If the two results of one element from 
two tries are different, we say that this element has inconsistent results. 


(1) If there are two elements which have inconsistent results, for 
example b; and b;,,, we know that processor i+ 1 is faulty. The 
correct result of b; can be obtained from the first try and the correct 
result of b;, , can be obtained from the second try. 


(2) If there is only one element which has an inconsistent result, for 
example b;, then either the processor i produced a transient error in the 
first try or the processor i+1 produced a transient error in the second 
try. The correct result of b; can be obtained (i) by subtracting the 
difference of the computed sum of elements of the vector and the 
checksum to the erroneous element in the information part, or (ii) by 
replacing the checksum by the computed sum of the information 
elements in the summation vector, in the case where the checksum is 
incorrect. This correction procedure can be based on the data either 
from the first try or the second try. 


(3) If there is no element which has an inconsistent result, then large 
roundoff errors have been detected in the first try. 


7. Error Correction For Multiple Faults 


From Theorem 1 and the construction principle of the WCC, we can 
construct a WCC with a distance (t+1) such that the code can detect up 
to ¢ errors. A time redundancy method can then be used to perform the 
error correction and identification of faulty units. The general theory of 
the WCC has been reviewed in Section 3. In this section we will 
concentrate on the error correction by using time redundancy. 


There is assumed to be a set of jobs J = { j;, jo,..} to be 
performed, and a set of identical units U = { u,, uz, .. } available to 
perform them. For example, we want to calculate the equation 
An, n Xn, 1 = Bn, in a linear array with n processors. The computation 
of each element of the result vector will be thought as a job. Once we 
detect any errors with the WCC, each job will be reassigned to a unit 
which is different from the units which have been assigned to do this 
job during the previous computation. When the jobs have been 
completed by the units, the results are compared to the previous results. 
The outcomes of such comparisons are the basis for identifying faulty 
units and obtaining correct data. 


A system under a ¢-fault assumption refers to one in which up to ¢ 
faulty processors are permitted. It will be assumed that when two faulty 
units perform the same job, they do not produce identical, incorrect 
results [13]. This is also shown in Figure 2. The outcome "pass" 
indicates that both units at this computation are fault-free, or the output 
data calculated by these two units are reliable. The outcome "fail" 
indicates that at least one of the units is faulty. 


Unit 1 Unit 2 Comparison outcome 
fault-free  fault-free 0 (pass) 
fault-free fault 1 (fail) 

fault fault-free 1 (fail) 

fault fault 1 (fail) 


Figure 2. The outcomes of a comparison of a pair of units 


It is also assumed that there is a host computer which collects the 
information on comparisons and, thus, derives the state of the whole 
system. Here, we require that the diagnosis never be incorrect in the 
sense that a fault-free unit is diagnosed as faulty. 


Theorem 7: For a fault system, at most ¢+ 2 tries are required to 
identify the correct data and the faulty processors. 


In real situations, we may usually use only a small number of tries to 
identify the correct data and faulty processors. For example, we have a 
system with t= 4. Figure 3 shows an example with four faulty 
processors and a hypothetic set of outcomes. The jobs which are 
incorrectly performed are marked with *. The maximum number of 
tries to locate the correct data and the faulty processors is 6. From 


Figure 3, we obtain the correct results of j4 and j¢ after two tries. But 
we do not know the correct results of j;, j2, j3, js and j7. No units 
can be identified as faulty. After the third try, we obtain the correct 
results of j;, j2 and j7, and we know that u, and wu; are faulty. But 
we still do not know the correct results of j; and j;. After the fourth 
try, we obtain the correct result of j; and identify that us is faulty. 
After the fifth try, we obtain the correct result of j; and identify that u, 
is faulty. Thus, instead of six tries, we have only used five tries to 
identify the correct results and the faulty processors. If we can replace 
or repair the faulty processors right after they are identified, the number 
of tries might be reduced further. In this particular example, if we 
replace u, and u; after the third try, we can get the correct results of 
both j; and js; and identify the faulty processors us; and u¢ right after 
the fourth try. Thus, instead of five tries, four tries are enough to locate 
the correct results and identify the faulty processors. 


tries uy, Un Uz Ug Us Ug U7 
Lo fok fb is b&b is 
2 1 Ln a ee j4 js Je 
3 jg da hy dn a aT 
4 js je dy ht ja Js Sa 
5 ja ds Je dn Kh fo Je 


Figure 3. An example of job assignments and the results 


The algorithm for determining correct data and identifying faulty 
processors is described below: 


(1) Assign each job to a unit which is different from the ones have 
been assigned to execute this job in the previous tries. 


(2) Check the results of each job with its previous results. If there is 
an outcome "pass", then the correct output of that job is identified. All 
the processors which produce the erroneous outputs of that job are 


identified as faulty. 
(Optional -- replace or repair the faulty processors) 


(3) If there is any job for which we still do not know the correct 
output, go to step (1). Else, exit. 


If we obtain correct outputs of all jobs in the second try, we know that 
large roundoff errors have been detected in the first try. 


8. Conclusion 


In this paper we proposed a concurrent error detection scheme using the 
WCC with low hardware overhead for matrix algebra and signal 
processing with highly concurrent VLSI structures. A simple roundoff 
error analysis is used to guide construction of the WCC. A new data 
retry technique is used to locate the faulty processors, obtain the correct 
results, and distinguish between roundoff errors and functional errors. 
Such an approach provides rapid error detection with low hardware 
overhead and solve the roundoff error problem in floating-point number 
systems. System performance is also not significantly degraded for the 
sake of fault tolerance. 
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EFFICIENT DESIGNS OF PRIORITY QUEUE 
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ABSTRACT 


VLSI designs are examined for the priority queue prob-. 


lem. We develop designs with superior performance to 
earlier designs. 


Keywords and Phrases 
VLSI architectures, systolic systems, priority queue 


1. Introduction 


VLSI architectures for a variety of problems have 
been proposed by several authors. A bibliography of over 
150 research papers dealing with this subject appears in 
[8]. In this paper, we are concerned solely with the prior- 
ity queue problem. Many applications require the ability 
to insert records into a set and to retrieve from the set the 
record having the smallest key according to some order- 
ing. A data structure that provides such services is called 
a priority queue. 

In evaluating our designs, we assume that the VLSI 
system will be attached to the host processor using a bus. 


The evaluation of a VLSI design should take the following 
into account: 


1. Processors --- how many processors are used in the 
VLSI system? This figure is denoted by P. 

2. Bus bandwidth --- the maximum amount of data to be 
transmitted between the host and the VLSI system in 
any cycle. This figure is denoted by B 

3. Speed --- how much time does the VLSI system need to 


complete its task and be ready to accept the next 
operation? This time may be decomposed into two 
non-overlapping times T (time for computations) and 
Tp (time for data transmissions both within the VLSI 
system and between the host and the VLSI system). 


One may expect that by using a very high bandwidth 
B and a large number of processors P, we can make To 
and Tp quite small. So, T¢ and Tp are not in themselves 
a very good measure of the effectiveness with which the 
resources B and P have been used. Let D denote the 
total amount of data that needs to be transmitted 
between the host and VLSI system. The ratio 


Rp = B* Ty/D 


measures the effectiveness with which the bandwidth B 
has been used. Clearly, Rp = 1 for every VLSI design. 


Let C’ denote the time spent for computation by a 
single processor algorithm. The ratio 


Ro =P *FTe/C 
measures the effectiveness of processor utilization. Once 
again, we see that Ro = 1 for every VLSI design. 


In evaluating a VLSI design, we shall be concerned 
with T¢ and Tp and also with Rg and Rp. We would 
like Ro and Rp to be close to 1. Finally, we may combine 
the two efficiency ratios Ro and Rp into the single ratio 
R= Rg * Rp. A design that makes effective use of the 
SJaleble bandwidth and processors will have R close to 1. 


The efficiency measure R as defined here is the same 
as that used in [1]-[4] to evaluate VLSI designs for matrix 
multiplication, finite impulse response filter, recursive 
filter and back substitution. This measure is also quite 
similar to that proposed in [6]. In fact, the two measures 
become identical when Ty = Tp. 


—me 
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For each of the designs considered in this paper, we 
compute Ro, Rp and Ff. In all cases, our designs have 
improved efficiency ratios than all earlier designs using the 
same model. In comparing different architectures for the 
same problem, one must be wary about over emphasizing 
the importance of Ro, Rp and R. Clearly, using P = 1. 
and B = 1, we can get Rc = Rp = R = 1 and no speed 
up at all. So, we are really interested in minimizing T¢ 
and Tp while keeping F close to 1. 


For the priority queue problem, the single processor 
algorithm uses the heap data structure. When n records 
are already in the heap, both insertion and delete-min 
operations take 3logn comparisons including the test for 
the end of heap. So, for a single operation, C = 3logn 
and D = 1. For comparison among different designs, the 
parameter n is used where n is the maximum number of 
values that the designs can handle. 


VLSI designs for this problem have been proposed 
earlier in [5], [7] and [9]. All designs use a linear bidirec- 
tional chain of PEs. The design of [5] permits an 
insert /delete-min operation in every four cycles. In each 
cycle, at least two comparisons have to be made, one to 
determine whether neighboring elements are out of order 
and the other to check the status of its three neighboring 
PEs, two left and one right. The performance figures of 
[5] is P = 38n, B = 2, To = 8, Tp = 8, Ro = Bn/logn, 
Rp = 16 and R = 128n/logn. The design of [9] is ready 
to receive an operation in every 2 cycle with each cycle 
requiring 1 data move and 3 key comparisons to order 3 
numbers. Since PEs work in alternate cycle, the number 


of PEs can be reduced by half. So P = n/2, B = 2, 
To = 6, Tp = 2, Ro =niflogn, Rp =4 ~ and 
R =An/logn. The design of [7] is ready to receive an 


operation in every cycle but each cycle requires 6 key com- 


parisons to order 5 numbers in a special order. Their 
design has P=n/2, B=3, To =6, Tp =1, 
Ro = n/logn, Rp = 3and R = 3n/logn. 


In between two priority queue operations, application 
program usually performs some processing which is likely 
to take times much longer than Tco+Tp. Since these 
hardware designs operate continuously, an no-op opera- 
tion is required when neither an insertion nor a delete-min 
is necessary. None of the above designs handle no-op 
explicitly. However both the designs of [7] and [9] can 
perform no-op by the input of (~@,—), while the design of 
[5] accomplish this by the input of ». In all our designs, 
an input operation (insert, delete-min, no-op) will be per- 
formed when the designs are ready to accept a new opera- 
tion. PEs are numbered from left to right with PE 1 
being the leftmost PE. 


2. New Designs 


In our first design, a linear array of n PEs are 
required. Each PE has three registers, a, s and /. Regis- 
ter / contains the value input from its left neighboring PE 
in the current cycle. For the leftmost PE, this is the input 
to the design. Register s is the status variable. When 
s = 0, the last operation performed is an insert and the 
value of register a in the PE is to be used directly. How- 
ever when s = 1, the last operation performed is a delete- 
min. The value in register a has already been moved to 
its left neighbor in the previous cycle and it will be 


replaced by the value coming in from its right neighbor. | 


Initially for all PEs, s = 0 and the contents of registers a 
and | are a (= © — 1). Inserting a new value is simply 
done by the input of the value to the leftmost PE. 
Delete-min operation is done by the input of a special 
largest value ©. No-op can be achieved by the input of a. 
An example sequence of operations are shown in Figure 1. 
For each PE, the value of register s is shown above the 
value of register a. The even and odd PEs execute alter- 
nately as in the design of [9]. The exact workings for PE 
t, when active, are formally described in Algorithm 1. 
From Algorithm 1 and Figure 1, the performance figures 
of the first design are P = n, B= 1, Tp = 2, Tp = 2, 
Ro = 2n/(Blogn), Rp = 2 and R = 4n/(3logn). Since 
odd and even PEs execute alternately, the number of PEs 
can be reduced to n/2 with Ro = n/(3logn) and 
R = 2n/(3log n). 


loop 

do in parallel | 
lp Gy 
a; — A544 

end 

do in parallel 
a; ~ min(I; a,); 1; - maz(l; a;) 
if /; = © then s,; « 1 

else s; - 0 


ifs; = 1 


end 
forever 


Algorithm 1 


The second design makes use of a linear chain of 
o(n/3) PEs. In each PE, four registers a, 6, ¢ and d are 
required. Register a is the value kept in the PH, registers 
b and c¢ are the values sent from its left neighbor and 
satisfies the relation 6 = c. Register d is the value just 
moved in from its right neighbor. In each cycle, after 
values have been moved in from its left and right neigh- 
bors, each PE will rearrange the contents of registers a, 0, 
c and d in such a way thatd =a S60 Sc. Sinceb Sc 
originally, the above rearrangement process only requires 
4 comparisons. To insert a value, simply input the tuple 
(—,value) to the leftmost PE. Delete-min is performed 
by the input of the tuple (,) and no-op is to input the 
tuple (—,0). An example sequence of operations are 
shown in Figure 2. The exact workings for PE 7 are for- 
mally described in Algorithm 2. The functions max2 and 
min2 will find the second largest and second smallest 
values in the given list respectively. From Algorithm 2 
and Figure 2, we see that the performance figures for the 
second design are P = [(n + 2/3], B= 3, To = 4, 
Tp = 1, Rc = 4n/(9logn), Rp = 3 and R = 4n/(3log n). 


loop lsisp 
do in parallel 
b; - b,_, {bg and cg are input} 
¢;-¢;-; {where b;_; = ¢,;_} 
d; - dji4 O=1 <p {dp is output} 
end 


dj, = min (a; ,b,,c;,d,) 
a; - min 2(a,,b;,¢, 


C 
forever 


Algorithm 2 


Sad ae a a al al la aca agg 


Eked 


Figure 1 


All the previous designs have a R value of 
O(n/logn). The third design improves the ratio R to 
O(1) by using a chain of only logn PEs. A fictitious PE, 
PE 0, is assumed to handle the input and output of the 
design. As in our first design, the even and odd PEs exe- 
cute alternately. This design tries to simulate the action 
of a min-heap which is a complete binary tree with the 
property that the value of a node is not greater than its 
two sons. A min-heap with n elements has [log n | levels. 
Kach PE in the chain will therefore be responsible to 
maintain a level in the min-heap. Quinn [10] have shown 
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that a tightly coupled shared memory multiprocessor with 

logn | processors can remove an element from an n- 
element heap in constant time. However, besides using 
shared memory, their design also requires all insertions to 
be done before all deletions which fails to handle the case 
of random insertions and deletions as required by priority 
queue. 


A single processor delete-min operation will delete 
the minimum element from the root of the min-heap, re- 
inserted the last element into the root and then reheapify. 
However, in any design using a chain of processors, this 
element may be in the process of getting to the last pro- 
cessor and hence is not known immediately. Therefore, a 
modification to the single processor algorithm is required. 
Now suppose that the element which is moving up from 
the next level to replace the deleted element is chosen to 
be the minimum of its two sons. Repeating this process 
all the way to the lowest level of the heap may create 
empty locations in the data structure. One major draw- 
back of this is that the number of processors required to 
handle n elements will be greater than [logn]. Our 
design is based on the observation on how to maintain a 
min-heap in a linearly connected chain of processors. 
When an element is deleted in the previous level, the ele- 
ment that is moving up to replace it is chosen to be the 
minimum of the following three numbers: its two sons and 
the last (rightmost) element at the same level of the two 
sons in the min-heap. The above process repeats with the 
two sons of this minimum being used at the next level. 


Let the elements of the min-heap be stored in the @ 
arrays of each PE. PE 7? will require a memory of size 2’. 
Since the amount of memory required in the worst case is 
approximately n/2 for the last PE, so instead of using 
registers to store these a values, random access memory 
will be used because its cost is cheap and it is readily 
available. The access time of random access memory will 
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only be a constant multiple of the access time when regis- 
ters are used. Besides the storage for the a values, regis- 
ters are required for the following variables: s, c, [, r, t, 
€, pt, pn, pp, vl and ur. Register s has similar meaning 
here as in our first design. When s = 0, the last operation 
performed is an insert. However, when s = 1, the last 
operation performed is a delete-min. Register / contains 
the maximum number of memory locations in this PE, so 
for PE 1, 1 = 2°. Register c indicates how many locations 
of the a array are currently being used, solsc <1. 
Register r, ¢ and e are responsible to keep track of which 
path the next insertion is to go down the PE chain. The 
element a[pz:] is used to compare with the element input 
from the previous PE on its left. As for register pn and 
pp, if the element a[pp| of PE 1—1 is being deleted and 
sent back to PE 1-2, the index of its right son, 
pn = 2* pp, will be sent toPE 7. The elements that PE 1 
sent back to PE :—1 will then be placed in a[pp| to 
replace the deleted element. Register vl and vr contains 
respectively the values sent from its left and right neigh- 
boring PEs. As for initial configuration, the initial values 
for the a arrays are a. The initial values for other regis- 
ters in PE 1 are as follows: s=c =0, l=e = 2', 
pt = r = t = 1 and register pn of PE 0 will always be set 
to 2. Finally, the operation of inserting a value is simply 
done by the input of the value to the leftmost PE. 
Delete-min operation is done by the input of © and no-op 
can be done by the input of a. 


An example sequence of operations for the third 
design are shown in Figure 3. Here, only the contents of 
the a arrays, vl and ur are shown. The exact workings of 
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PE 7 during its active cycles are formally described in 
Algorithm 3 where reg, is the reg register of PE ¢ and its 
subscript is dropped if it is understood to be of PE 1. 
From Algorithm 3 and Figure 3, the performance figures 
of this design are P = O(log n), B = O(1), To = O(1), 
Tp = O(1), Re = O(1), Rp = O(1) and R = O(1). 


do in parallel 
pn, - Ae 1 {png = 2} 
ul; < vl, _ {ul is input} 
a(pp] - ae ifs; = 1 {¢ = Ois output} 
end 
if vl; <a then {insert} 
do in parallel 
s,-O;t<-(¢ +r —2)modrt+1 
ifc </ then 
do in parallel 
alpt] -vl;;c ~<c +1 
end 
else | 
do in parallel 
a|pi| — min(a[pzt],vl;) 
vl; — maz(a[pi, vf, 


en 
endif 
end 
if t = r then 
do in parallel 
pt-pt mod l +1 
e«(e +!1—2)mod/+1 
end 
if e = | then 
do in parallel 
t<2t;r-<2r 
end 
endif 
endif 
elseif vi; = a then {no-op} 
Ss; 0 
else {delete} 


do in parallel 
Let a[pp|] = min(a [pn — pb a[pn],a[e]) 
vr; ~ alpp]; pn ~ pp 
5; ‘i a 


end 
if ur; # a then 
if ¢ = 1 then 
do in parallel 
pt -(pt +1 —2) mod! +1 
e-~e mod! +1 
end 
ife = 1thenr -r/2 
endif 
ifr = 1 then {rightmost active PE} 
do in parallel 
e«c — 1; a[pp] - alc] 
s; - 0; ul; <a 
end 
ale+1)-a {to avoid conflict when pp=c} 
endif 
else 


do in parallel 
8, «0; ul; <a;t<-1 


Algorithm 3 


3. Summary 


The performance figures of the various VLSI archi- 
tectures for the priority queue problem are summarized in 
Table 1. As can be seen, all our designs represent an 
improvement over earlier designs. Our third design is the 
first VLSI system that has attained an R value of O(1). 


Finally, we note that the comparisons among the 
different designs are not entirely fair as our third design 
requires different and considerable amount of memory for 
each of the logn PEs. However, the total amount of 
memory used in all designs are the same, namely O(n). 


Architecture 
Bidirectional Chain 


C = 3logn, D = 1 
Table 1 
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ALGORITHMS FOR HIGH SPEED MULTI-DIMENSIONAL 
ARITHMETIC AND DSP SYSTOLIC ARRAYS 
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Abstract - With the advent of 3-D VLSI and the 
essentialness of CAD tool in design, the demand for 
high speed computation in several arithmetic and digi- 
tal signal processing (DSP) applications can be met by 
having a systematic technique for transforming algo- 
rithms to specific forms for mapping onto multi- 
dimensional systolic arrays. This paper presents such a 
technique (called STAMS). The resulting multi- 
dimensional systolic arrays derived from the technique 
give significant improvements in computation time 
compared to their 1-D counterparts, yet maintaining 
the same number of processing cells. Two examples are 
illustrated in the paper: the matrix-vector multiplica- 
tion algorithm and the k-point Discrete Fourier 
Transform (DFT) algorithm. The technique can also be 
applied to other problems such as the FIR filter algo- 
rithm and the 1-D convolution algorithm. An example 
of an entire systematic transformation and mapping 
procedure that can be incorporated into an integrated 
CAD package suitable for user-friendly interactive 
design is also given. 


1. INTRODUCTION 


Systolic arrays have been developed for the imple- 
mentation of many arithmetic and digital signal pro- 
cessing (DSP) algorithms in the past decade. With the 
rising demand for high-speed computations in these 
applications and the recognizing of three-dimensional 
(3-D) VLSI chips [1,2], the need to speed up algorithm 
computation by going beyond 1-D (and sometimes even 
2-D) systolic networks has increased [3,4]. Many 
throughput improvements have been shown by higher- 


dimensional systolic array implementation. For exam- 
ple, significant throughput improvements have been 
shown by a 3-D systolic array implementation of 
matrix-matrix multiplication [5] and simultaneous triple 
matrix multiplication [6]. However, many of the exist- 
ing methods in mapping algorithms onto multi- 
dimensional networks are ad-hoc, which take long 
design time and cannot be developed as part of the 
CAD tool. The benefits of 3-D VLSI technology and the 
improvements in timing by higher-dimensional struc- 
tures can be fully exploited only if we can devise a sys- 
tematic algorithm transformation and mapping tech- 
nique, which is the scope of this paper. 


With the advent of silicon-on-insulator (SOI) tech- 
nologies, 3-D circuitry are being realized using tech- 
niques such as laser recrystallization of polysilicon, 
which allows fabrication of active devices stacked in 
two or more layers. Some laboratories have already suc- 
ceeded in producing 3-D circuit cells [7,8,9]. With 3-D 
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VLSI, wire routing becomes easier, more systematic and 
shorter, due to the use of the third dimension. The 
interconnect wire length. increases at a much slower 
rate than planar ones. Moreover, the gain in circuit 
density, which results in saving of materials, has per- 
mitted much larger networks to be implemented. The 
increase in packing density, together with the improve- 
ments in wire routing, lead to a decrease of parasitic 
capacitances in circuit, and hence to an increase in 
speed. Besides these benefits, design time can also be 
minimized as wire-routing is easier and more sys- 
tematic. 


The benefits and the reality of 3-D VLSI necessi- 
tate having efficient and systematic (automatic) imple- 
mentation techniques. These technological advances 
and opportunities, together with the essentialness of 
CAD tools in array design and the need for high speed 
computations, represent a challenge in systematic map- 
ping of arithmetic and DSP algorithms onto multi- 
dimensional arrays [10]. 


2. REVIEW OF PRIOR ART 


The design of systolic arrays requires a fundamen- 
tal understanding of application, algorithm, and archi- 
tecture. A survey of literature with respect to sys- 
tematic methods of mapping and transforming algo- 
rithms onto systolic arrays reveals many stimulating 
and efficient ideas. For example, S.Y.Kung [11,12] pro- 
poses a mapping technique based on data dependence 
graph, signal flow graph and its systolization. Moldo- 
van [13,14] develops a mapping procedure for cyclic 
loop algorithms based on mathematical linear transfor- 
mation of index sets and data dependence vectors. 


Capello [15,16] presents geometric transformation and 
linear space-time transformation techniques in array 
design and representation. This provides an insightful 
look into how several systolic designs of the same algo- 
rithm relate to each other. Leiserson 117] provides a 
systolization scheme for minimizing the number of 
delay elements. Quinton [18] produces a systematic 
method for mapping algorithms that can be expressed 
by a set of uniform recurrence equations. The method 
uses a timing function and an allocation function to 
map these equations onto a finite architecture. 


Many of these authors have proposed procedures 
for systematically mapping an iterative algorithm 
defined over a multi-dimensional index-space onto a 
lower-dimensional array of processors, using linear 
transformations. They restrict their attention to one- 
dimensional projection so that if the index-space is N- 
dimensional, then the systolic array is (N-1} 
dimensional, with one dimension for time. 


In this paper, an attempt is made to increase the 
dimension of the index-space for certain class of itera- 
tive algorithms in a systematic way in order to achieve 
higher parallelism without increasing area (silicon chip 
area) complexity. The method first transforms the algo- 
rithm by increasing the index-space from N-dimension 
to M-dimension (M>N). Mapping of the algorithm 
with M-dimensional index-space can be obtained by 
combining the systolic array for the same algorithm 
with N-dimensional index-space, or can be realized by 
many of the linear transforming techniques mentioned. 
The technique is particularly suitable for many DSP 
and arithmetic algorithms. The resulting systolic net- 
work is (M-1}dimensional (> N-1) usually. By doing 
so, the computation time, which can be defined as the 
time interval between loading the first input and 
unloading the last output of a problem instance 
into/from the array, and its order of complexity can be 
significantly improved while keeping the number of 
processing cells (which is the silicon chip area complex- 
ity in many cases) constant. The price to be paid is the 
small amount of additional circuitry (usually in the 
form of adders and interconnection wires) required for 
inter-row or inter-plane communications. The multi- 
dimensional systolic network can be laid out by either 
2-D or 3-D VLSI chip [1,2]. 

Most of the currently available methods that do 
implement algorithms by high-dimensional systolic net- 
works to achieve higher parallelism are based on ad-hoc 
procedures. Having a systematic mapping algorithm for 
multi-dimensional network is not an easy task and it 
can be an NP-complete problem due to the diverse and 
several factors and constraints controlling the mapping 
process. However, it has the benefits of reducing design 
time and producing efficient mappings. Moreover, it can 
also be incorporated into an integrated CAD tool (an 
array compiler [12], for example) for automated array 
design. A systematic method to transform and map a 
class of algorithms to high-dimensional network, called 
STAMS (Systematic Transformation of Algorithms for 
Multi-dimensional Systolic arrays), is presented in this 
paper. STAMS technique is presented in Section 3. 
Two application examples to illustrate this transforma- 
tion technique are presented in Sections 4 and 5. 


3. STAMS: 
SYSTEMATIC TRANSFORMATION 


OF ALGORITHMS FOR 
MULTI-DIMENSIONAL SYSTOLIC ARRAYS 


The kind of algorithms that is considered for 
STAMS technique is especially common in many arith- 
metic and DSP applications. It is of the form 


k-1 
Yr = YJ} f(w,i)*g(x,r,i)*... (1) 
i==0 
where ‘‘*” indicates multiplication, f(w,i) represents a 


function f with a variable w and an index i of w (i can 
form the subscript or the power of w), and g(x,r,i) 
represents a function g with a variable x and indices r 
and i associated with x (r and i can form the subscript 
or the power of x). The index r is also the subscript of 
y. Examples of such algorithms are: 


k-1 
(1) Matrix-vector multiplication: y, = }) a,j4)*bi4y 
i=0 


k-1 
(2) 1-D convolution: y, = Y) wi4)*X,4; 


. 


1=0 


(3) k-point Discrete Fourier Transform (DFT): 


k-1 : 
hf anes pe xpewr" 
i=0 


(4) k-tap finite impulse response (FIR) filter: 
k-1 
Yr = D5 Wi*X, 
 i=0 


Computation of these algorithms are conventionally 
carried out by 1-D systolic arrays of k cells, as shown in 
Fig.1. These 1-D systolic arrays can be obtained by 
many linear mapping procedures listed in Section 2. 
The 1-D systolic array realization of the matrix-vector 


multiplication, the 1-D convolution, the k-point DFT 
and the k-tap FIR filter problems are given in 12,19], 
[12,20], [21] and [21], respectively. STAMS technique tc 
obtain the corresponding multi-dimensional systolic 
arrays is described in the next three subsections. | 


5 ae kK cells — 


Fig. 1 A 1-D systolic array of k cells 


3.1 Derivation of 2-D Arrays using STAMS 


In Eq.(1), if k is not a prime number, it can then 
be expressed as a product of two integers p and q (i.e. k 
= pq). Let ij = q*i+j and rt = p#r+t, raf) can 
then be rewritten (transformed) as 


p-1 q-1 
Yr = D7 D> f(w,ij)*g(x,rt,ij)*... (2) 
i=0j=0 


The index space is hence increased. If the original prob- 
lem using Eq.(1) requires the computations of yi,r= 
0,1,...,k-1 (i.e. y,4, rt = 0,1,...,k-1 ), the same problem 
using Eq.(2) will require the computations of Yr for r 
= 0,1,...,.q¢-1, and for each r, t = 0,1,...,p-1. Different 
mathematical expressions of Eq.(2) can be exploited to 
select a suitable or efficient expression for implementa- 
tion. Step by step sequential algorithm for computing 
Ea(2) can then be developed and mapped onto a 2-D 
systolic array, usually of p rows, each with q cells, as 
shown in Fig.2a. The position of each cell is indicated 
by ij (ith row and jth column, starting from 0), and its 
corresponding position in the 1-D array is indicated by. 
q*it+j. In the 2-D array, computations in the rows 
and/or the columns can be carried out in parallel to 
Improve computation speed. The 2-D array consists of 
p*q (=k) cells and therefore the area complexity is not 
increased compared to that of the 1-D array, except 
that a small amount of inter-row communication cir- 
cuits are added. 


— qcells | — 


Fig.2a A 2-D systolic array of k (= p*q) cells 


3.2 Derivation of 3-D Arrays using STAMS 


If k can be expressed as a product of many 
integers, the algorithm can then be directly mapped 
onto a higher-dimensional systolic network. For exam- 
ple, if k = p*q*s, let ijm = q*s*i+s*j+m and rtu 
q*p*r+p*t+u, Eq.(1) can be transformed to 


~1q-1 s-l 
Yrtu = > > » f(w,ijm)*g(x,rtu,ijm)*... 


i=0j=O0m—=0 


(3) 


The index-space is thus increased further. If the origi- 
nal problem using Eq.(1) requires the computations of 
Yp» vt = 0,1,...,k-1 (ie. yy, rt = 0,1,...,k-1 ), the same 
problem using Eq.(3) will require the computations of 
Yrty for r = 0,1,...,s-1, and for each r, t = 0,1,...,q-1, 
and for each t, u = O,1,...,p-1. Different mathematical 
expressions of Eq.(3) can be exploited to select a 
suitable or efficient expression for implementation. 
Step by step sequential algorithm for computing Ba.(3) 
can then be developed and mapped onto a 3-D systolic 
array of p planes, each with q rows of s cells each, as 
shown in Fig.2b. The position of each cell is indicated 
by ijm (ith plane, jth row and mth column) with its 
corresponding position in the 1-D array indicated by 
q*s*its*jt+m. Higher speed can be achieved by parallel 
computations in the planes and in the rows. The 3-D 
array consists of p*q*s (=k) cells and therefore the 
area complexity is not increased compared to that of 
the 1-D array. 


p 
planes 
PA 


— —_— 


s cells 


Fig. 2b A 3-D systolic array of k (=p*q*s) cells 


3.3 Other Considerations 


4-D or higher-D systolic networks can also be 
mapped by the similar extensions in STAMS technique 
and by further increasing the index space dimension. 
In general, systolic network with higher dimension pro- 
duces higher computation speed at the expense of more 
communication circuitry. Moreover, the number of pro- 
cessing cells will not be a good measure of area com- 
plexity since laying out a 4-D or higher-D systolic net- 
work on 2-D or 3-D VLSI will have an area complexity 
higher than the number of processing cells due to addi- 
tional interconnections required. An optimal trade off 
among area, time and layout complexity is thus neces- 
sary to achieve an efficient network. 


Besides these, different values of p, q, 8, ..., can 
also be used to achieve the best trade-off. On the 
implementation: level, efficient techniques for laying out 
multi-dimensional systolic arrays onto 2-D or 3-D VLSI 
chips must be devised so that the length of interconnec- 
tion wires, propagation delays, and synchronization 
problems, can be kept to the minimum. 


Combining the STAMS technique just discussed 
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with a standard mapping methodology (for example, 
the canonical mapping method developed by S.Y.Kung 
11,12]), a complete procedure for mapping algorithms 
of Eq.(1) form) onto multi-dimensional systolic arrays 
can be developed. An example is shown briefly in 
Fig.3. The entire procedure is systematic and can be 
incorporated into an integrated software CAD package 
(array compiler) for user-friendly interactive design. 


Spproxmation k = non-prime ? 
allowed? 
Y 
Set k = nearest 


Set k = p*q*... and transform algorithm 
non-prime 
number Select suitable math. expression 


Obtain signal flow graph 
* processor assignment 
* scheduling 


Systolize SFG to systolic array 


N Optimized? 
¥ 


Processor Design and hardware implementation .. 


Select different 
dimension, or different 
values of p,q,..,etc. 


Fig. 3 Procedure for mapping algorithms onto multi-dimensional systolic arrays 


In this paper, the STAMS technique is applied to 
the matrix-vector multiplication problem and the k- 
point DFT problem. It can also be applied to many 
other problems such as the 1-D convolution problem 
and the FIR filter problem. In all cases, computational 
times are improved without increasing the number of 
processing cells. 


4. MATRIX-VECTOR MULTIPLICATION 


The matrix-vector multiplication problem A * b 
for matrix A and vector b is defined as follows: 


Given b,,b.,..., by and 


4119919 “2 6 9 44k 949192995 a, | Akks 


compute y1,Yo, +--+» Yk 


k-1 
defined by y, = )j aniza*biss (4) 


i=0 


The above algorithm has a 2-dimensional index space (r 
and i). A linear 1-D systolic network for implementing 
matrix-vector multiplication can be produced as given 
in Fig.4 [12]. All y,’s are initialized to zeros and all b;’s 
are preloaded into the cells. a’s inputs are skewed as 
shown in the figure. Computations are pipelined and 


the results start to appear at the output of cell b, k 
cycles after the first input, followed by a new output 
every cycle. The last result appears 2*k-1 cycles after 
the first input, giving the total computation time of 
2*k-1 cycles. 


— k cells _— 
0,0,...,0 44 | bs oon Ya Vy 0.0 
a; 9 a) (a) 
451 412 0 0 
a3; a52 443 Q 
1 


a all ¥ out Yout — Yin + Fin * 0; 


in 


Fig. 4 1-D systolic array for matrix-vector multiplication 
and its cell definition 


4.1 Derivation of 2-D Array using STAMS 

If k is not a prime number and can be expressed as 
k = p*q, using STAMS technique, we let ij = q*i+j 
and rt = p*r+t and transform Eq.(4) to 


p-—1 q-l 
Yt = > Art, ij+1* dij +1 


i==0)=0 


(5) 


It now has a 3-dimensional index space (rt, i, and j). 
The result y,, can be computed by the following algo- 
rithm: 


(1) 


4 
Compute yj 5 Art, j+1*Dij413 
j=0 


which is the matrix-vector multiplication of size q. 


—] 
Compute y,, = > Yrti3 
i=0 


(2) 
which is the sum of (1) for i = 0,1,...,p-1. 


Each y,; can be computed by a linear 1-D systolic 
array, same as that of Fig.4, except for a smaller size q. 
The results are then summed up by adders to produce 
Yr. Thus a 2-D systolic array consists of p rows, each 
with q columns, as shown in Fig.5 is obtained for exe- 
cuting the algorithm. The position of each cell is given 


by ij (ith row and jth column, starting from 0). 


Computations in the rows are carried out in paral- 
lel to improve computation speed. The first result will 
appear q cycles after the first input. The last result 
will appear q+k-1 cycles after the first input, giving a 
computation time of q+k-1 cycles. Compared to the 
original 1-D array of Fig.4, the 2-D array consists of the 
same number of cells (k cells), i.e. same area complexity 
of O(k), whereas the computation time is improved by 
(k-q} cycles. The price paid is the small amount of area 
and time overhead of the additional adders. To 
improve the efficiency further, efficient multi-operand 
carry save adders, for instance, can be used. We are 
currently in the process of laying out such a 1-D and 
the corresponding 2-D systolic array for matrix-vector 
multiplication using NORA CMOS bit-parallel logic 
structure and CMOS p-well process technology. 


370 


— q cells 


é 
0,0;...0 —H Pgs 4 Pores 


0 


501 ot+t 
02,01+1 


| sat ,0O+1 
02,00+1 
! 


i 
op ,00+1 
“11 0+t 
i 


t 
i 
2 Op 0141 701 Oq-i+1 
741 Otet ' 
5 
7Op Oq-1+1 


: ' 
a 

-1 p,OO+1, s 
qt pe, 3 “t 1 ,Oq-1 +1 


q-t p,O1+t 
] 


i 
9 art p,Og-t+t 


0,0,...,0 Doe Passi ae) 


a 0 
p rows 01,10+1 


a 
O1,1i+t 
' 


a 
O1,Iq-i+t 


' ' 
F t 
; 1 
| | 
’ 
oe 5 ce a 0.0 
1 
' ' 
' ' : ‘ 
t 


| G02.40 


701 p-t O+t 
Ol p-1 t+ | 


' 
ry 
' A ’ 
' 1 
1 
fodted—~fok 
; ; ; 
0 Q 
a ' ; 
: 0 
t 


* Of pel q-t+t 


Fig.S 2-D systolic array for matrix-vector multiplication 


4.2 Derivation of 3-D Array using STAMS 


If k = p*q*s, using STAMS technique, we let ijm 
= q*s*i + s*j + m and rtu q*p*r + p*t + u, and 
transform Eq.(4) to 


-lq-1 s-l 
Yrtu = > 3 »; Artu,ijm+1* Dijm41 
i=0)—=O0m—=0 


(6) 


This has a 4-dimensional index space (rtu, i, j, and m). 
The result y,,, can be computed by the following algo- 
rithm: 


s—1 
Compute Yrpuij = 5 Artu,ijm+1*Pijm4si 
m=0 


(1) 
which is the matrix-vector multiplication of size s. 
-1 
3 Yrtuij? 
j=0 


which is the sum of (1) for j = 0,1,...,q-1. 


(2) 


Compute Y,tyi 


p-1 
Compute Yrtu = >> Yrtuis 
i=0 


(3) 
which is the sum of (2) for i = 0,1,...,p-1. 


Each y,yj; can be computed by a 1-D systolic array of 
size s. The results are summed by adders in parallel to 
produce y,y;8, which are further summed to produce 
Yrtyu: Thus a 3-D systolic array of p planes, each with q 
rows of s cells each, as shown in Fig.6, can be obtained 
to execute the algorithm. The position of each cell is 
given by ijm (ith plane, jth row and mth column). 
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Fig.6 3-D systolic array for matrix-vector multiplication 


Computations in the planes and the rows are car- 
ried out in parallel. The summing processes in the 
planes are also carried out in parallel. These parallel- 
isms improve the computation speed. The resulting 
outputs start to appear s cycles after the first input. 
The last output appears s+k-1 cycles after the first 
input, giving a computation time of s+k-1 cycles. 
Compared to the original 1-D array of Fig.4, the 3-D 
array has the same area complexity (k cells) of O(k) 
with computation time improvement of (k-s) cycles. 
This is also faster than the 2-D array case, in general, 
with more communication circuitry required. 


4.3 Derivation of Higher-D Arrays using STAMS 


Higher-D systolic networks can be _ similarly 
obtained by the application of STAMS technique, 
which will give further improvements in computation 
time at the expense of more communication circuitry. 
The final structure of the network can be the intercon- 
nections of modules of 3-D or higher-D array. How- 
ever, laying out a higher-D systolic network on 2-D or 
3-D VLSI will give an area complexity higher than O(k) 
due to additional interconnections, which may be 
undesirable. 


5. DISCRETE FOURIER TRANSFORM 


The k-point Discrete Fourier Transform (DFT) 
problem is defined as follows: 
Given X9,Xq, +--+ 5 Xk-43 


compute yo,yj,---5YVk1 


k-1 
defined by y, = > x,*w"™" 
i=0 


(7) 


where w is an nth root of unity. 


The k-point DFT can be viewed as that of evaluating 
the polynomial 


x, *wk! So X,_o* we? + * + X,)*W + Xo 
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by Horner’s rule: 


(...((Xj_1*W + Xpo)FW + Xy_g)tW + ++ XY)EW + Xp 

The computations of yo, y1, Yo, -.-) Yy_) are carried out 
using the above formula with w replaced by 1, w, 
w2,.... w! respectively. A linear 1-D systolic network 
to implement k-point DFT is shown in Fig.7 21]. The 
1-D network consists of k-1 cells (area complexity = 
O(k)). Computations are pipelined and the results 
start to appear k-1 cycles after the first input, followed 
by a new output every cycle. The total computation 


time is 2*k-2 cycles or O(k). 
Pee sucine 


Wout <~ Vin 


| out <— Vine Vin + xX 
Fig. 7 1-D systolic array for k-point DFT and its cell definition 


5.1 Derivation of 2-D Array using STAMS 
If k=p*q, using STAMS technique, we 
ij—q*i+j and rt—p*r+t and transform Eq.(7) into 


let 


-1q-1 be 
Yer = 3S yew eanin) 
i=0j=0 
Different mathematical expressions of the above equa- 


tion can be exploited to produce Eq.(8) for implementa- 
tion, as shown below: 


-1q-1 . 
va = SEE aypnteeingtey 
i=0j=0 | 


p-l1 q-l ‘ : : 
=» y Xie WO Lew (Pato (since wP*I=w*=1) 


i=—0j=0 


—1 p-l ; ; 
= BBE ayretteg on 


j=0i=0 


a | ; —1 : 
SE wire BE yet 


j=0 i=0 


(8) 


—1 me p-l ee 
y wP*T*] - ( w"*) re ( s: xewo ) ) 
j=0 i=0 


The result y,, can be computed by the following 
algorithm, express informally as: 


—] ‘ 
(1) Compute yj = ¥ xj*wt""!; a p-point DFT. 
i=0 


(2) Compute a; = w'*lxv,;; a multiplication. 


1 
(3) Compute y,, = > 


ZyjxwP*T*); a q-point DFT. 
j=0 | 


A 2-D systolic array of p rows and q columns is thus 
obtained to compute y,,’s. Each column corresponds to 


a linear 1-D systolic array of size p, with similar struc- 
ture as that of Fig.7, for computing the p-point DFT. 
The multiplication is achieved through multipliers. 
Registers are used for storing the resulting 2,,’s. Each 


row of the 2-D network corresponds to a linear 1-D sys- 
tolic array of size q, with similar structure as that of 
Fig.7, for computing the q-point DFT. 

The resulting 2-D systolic array of p rows and q 
columns is shown in Fig.8. All communications are 
local. The position of each cell is given by ij (ith row 
and jth column, starting from 0). The cell definition is 
given in Fig.9. The operation of the cell is controlled 
by the control input C;,. It can perform the same func- 
tion as that of the 1-D array cell in both the vertical 
and horizontal directions. Initially all C,,’s are set to 
1’s, the DFT of the columns are first computed in 
parallel by the left half of the cells, which give the 
results v,’s. Multiplications by w**!’ 
then performed at the bottom row and the results are 
fed back and stored in the right half of the cells. 
Finally, all C;,’s are set to 0’s and the DFT of the rows 
are computed in parallel to obtain y,’s. 
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Fig. 8 2-D systolic array for k-point DFT 


The 2-D array consists of p*q=k cells with area 
complexity O(k), which is the same as that of the origi- 
nal 1-D array. However, the computation time has 
been significantly improved. The first results yo,’s 
appear after 2*p+q+1 cycles, an improvement of k- 
(2*p+q}2 cycles. More importantly, total computation 
time is improved to 2*(p+q) cycles, which is O(p) or 
or whichever is larger. Compared to the O(k) (= 
O(p*q)) computation time of 1-D array, this is a 
significant improvement. If p=a=k! 2, computation 
time is improved from O(k) to O(k!/2). 
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Fig.9 Cell definition for 2-D systolic array implementation 
of k-point DFT 


5.2 Derivation of 3-D Array using STAMS 


If k = p*q*s, using STAMS technique, we let ijm 
= q*s*i + s*j + m and rtu = q*p*r + p*t + u, and 


transform Eq.(7) to 


-lq-1 s-l ( *p¥*r- +t+u)+( +s+i+s*j+m) 
= r 
im = 2S > Siew q j 
This can be further expressed as: 
p 
-Ilq-1 s-l . 
= 3 v > ya iia ea a aa 
(since wP*4*S — wk = 1) 
s-1 q-l p-l 7 
as ye ) ye Kiccawiye tte ant Desetaiiatat) 7 pear rea sens orn 
m=0j=0i=0 
s-l 
= > wP*aT#I 4 (yyP*O4M 4 yy UF 


m=0 


1 . pl 
(Fr wPtsttin(wotttin( Sx, ,4wH)))) (9) 


j=0 i=0 


The result y,,, can be computed by the following 
algorithm expressed informally as: 


p-l ; | 
(1) Compute ayy = S) Xijm*wO"""s a p-point DFT. 
i=0 


(2) Compute byjm = w"*)+a,;,,3 a multiplication. 


—] ; 
(3) Compute Vian = x bujm*w?***); a g-point DFT. 


j=0 | 
(4) Compute Zu, = wt? ™ew"*Mev. in} @ multiplica- 
tion. 
s—1 
(5) Compute yup = YS Zeum*wP?O"™; an s-point 
m=0 


DFT. 


Similar to the 2-D systolic array shown, a 3-D systolic 
array of p planes, each with q rows and s columns can 
thus be obtained to compute y,4,’s. The 3-D array con- 
sists of several 2-D arrays with similar structure as that 
of Fig.8, for computing the p-point DFTs, the q-point 
DFTs, the s-point DFTs and the multiplications. 
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The resulting 3-D systolic array of p planes, q rows 
and s columns is shown in Fig.10a, Fig.10b and Fig.10c. 
All communications are local. The position of each cell 
is given by ijm (ith plane, jth row and mth column, 
starting from 0). The cell definition is given in Fig.11. 
The operations of each cell are controlled by the con- 
trol inputs C1,,C2;,,. Each cell can perform the same 
function as that of the 1-D array cell in all the three 
directions. The operation of the array is a similar 
extension of the 2-D case. Initially, all C1,;,C2;,’s are set 
to O1’s, the DFTs along the direction perpendicular to 
the planes are first computed in parallel by the left half 
of the cells, which give the results a,j,’s. Multiplica- 
tions by w**"*!s to form b,,,,’s are then performed and 
the results are fed back and stored in the right half of 
the cells. These are shown in Fig.10b. C1,,C2;,’s are 
then set to 10’s and the DFT of the columns are then 
computed in parallel by the right half of the cells to 
give Viy,'s- Multiplications by w?**™s and w"*™s to 
form Zym'Ss are then performed at the top of the net- 
work and the results are fed back and stored in the left 
half of the cells. Finally, C1,,C2;,’s are set to 11’s, the 
DFT along the rows are computed in parallel by the 
left half of the cells to obtain y,,,,’s. These are shown in 
Fig.10c. 


Fig. 10a 3-D systolic array for k-point DFT 
(showing the structure) 


The 3-D array consists of p*q*s=k cells with area 
complexity O(k), which is the same as that of the 1-D 
array. However, the computation time has _ been 
significantly improved. The first parallel outputs yoty’s 
appear after 2*p+2*q+s+2 cycles, an improvement of 
k-(2*p+2*q+s)-3 cycles. More importantly, total com- 
putation time is improved to 2*(p+q+s)+1 cycles, 
which is O(p), or O(9), or O(s), whichever is the largest. 
Compared to the O(k) (=O(p*q*s)) computation time 
of 1-D ani this is a significant improvement. If 
p=q=s=—k!/%, the computation time complexity is 
improved from O(k) to O(k!/%). This is also faster than 
implementing the algorithm by 2-D network, but more 
communication circuitry are required. 
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Fig. 10b 3-D systolic array for k-point OFT 
(showing for T= 0 to 2*p+1 cycles) 
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Fig. 10c 3-D systolic array for k-point DFT . 
(showing for T = 2*p+2 to 2*(p+q+s)+1 cycles) 


5.3 Derivation of Higher-D Arrays using STAMS 


4-D or higher-D systolic networks can be similarly 
obtained by the application of STAMS technique, 
which give further improvements in computation time 
at the expense of more communication circuitry. How- 
ever, laying out these networks on 2-D or 3-D VLSI will 


give area complexities higher than O(k) due to addi- ~ 


tional interconnections, which may be undesirable. 


out Sjout Din Chi Lin 


* cell storage denoted by x when Ci! inf 2jn= 01; denoted by z otherwise. 


C1, C2, 


in’<in = 00: 


No Operation 


C1, C2, 


iftin = Ol: 


Sout — Bin * Qin +X; 
Mout — in : 
b= D 
Clout SZour ~ C1,C2 i, - 


no operation on other signals; 
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Clout 2 out Cli Cin , 
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Yout — Yin * Min + 2; 


Wout Win , 
Cloyt © 2 gut C1, C2i, ; 


no operation on other signals; 


Fig. 11 Cell definition for 3-D systolic array 
implementation of k-point DFT 


6. CONCLUSIONS 


In general, there is no optimal interconnection 
topology for all algorithms, but, it depends on several 
factors such as the application, the data flow and the 
available layout technology. Having a systematic map- 
ping for algorithms onto multi-dimensional arrays is an 
important asset for an efficient implementation and can 
be developed as part of a CAD tool. The STAMS tech- 
nique presented in this paper produces systolic arrays 
with significant improvements in computation time and 
its order of complexity while keeping the number of 
processing cells constant. 


The technique is useful in many arithmetic and 
DSP applications. Two examples, the matrix-vector 
multiplication problem and the k-point DFT problem, 
have been given to demonstrate the proposed transfor- 
mation and mapping procedure. Significant improve- 
ments in computation time are achieved. This pro- 
cedure can also be applied to many other algorithms 
such as the 1-D convolution and the FIR filter. 


Even though better computation time can be pro- 
duced by increasing the dimension of the network to go 
beyond 3-D, the additional circuitry required for inter- 
plane communications and the length of the intercon- 
nection wires due to layout on a lower dimensional 
VLSI are also increased. An optimal trade-off (the 
method of which is beyond the scope of this paper) 


among area, time and layout complexity will be helpful 
in producing reliable and efficient implementations. 
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A HIGHLY EFFICIENT DESIGN FOR RECONFIGURING 
THE PROCESSOR ARRAY IN VLSI 


Hee Yong Youn and Adit D. Singh 
Department of Electrical and Computer Engineering 
University of Massachusetts 
Amherst, MA 01003 


Abstract — Wafer Scale Integration of processor arrays for 
the parallel computation offers important advantages, specifi- 
cally high performance, low power consumption and high relia- 
bility. However, low yield due to the large silicon area is a major 
problem that remains to be solved. This paper presents a highly 
efficient design for reconfiguring both the rectangular and tree 
architecture of processor array when a significant number of 
processors in the host array are faulty. The proposed scheme 
always allows the reconfiguration of the maximum size of array 
with short maximum reconfigured edge length. It also works 
consistently well even for clustered faulty processors. Compar- 
isons of the proposed design with others in the literature reveal 
that the improvements are quite substantial. The reconfigura- 
tion overhead is also found to be very small. 


1 Introduction 


Parallel processing using processor arrays is being widely 
investigated to overcome the performance limitations of tradi- 
tional uniprocessor computer systems. Some inherent problems 
with the traditional board level implementation of these sys- 
tems are separate packaging cost, assembly cost on the printed 
circuit boards, and low reliability due to the complex pin to 
pin interconnections on the boards. While all these problems 
are important, especially significant is the large signal propaga- 
tion delay in MOS VLSI technology required to drive signals off 
chip. Wafer Scale Integration(WSI)|1] promises a solution to 
this problem by integrating the entire processor array and the 
interconnection structure on a single wafer. Thus WSI makes 
it possible to eliminate the off chip signal drivers within the 
processor chips, and the complex board level interconnections 
among the processors. As a result, signal delays can be sub- 
stantially reduced. In addition, system reliability may also be 
improved because of elimination of the mechanical and electri- 
cal failures frequently observed at the pins and interconnections 
in traditional designs. 


Although WSI has many attractive features, the low yield 
problem due to the large chip area [2] must be overcome before 
such circuits can become practical. In conventional VLSI de- 
signs, the entire circuit is discarded if it contains even a single 
defect that is capable of causing a logical fault. For large area 
circuits, which have a high likelihood of containing at least one 
defect, this leads to extremely low yield. To overcome this prob- 
lem, an on-chip fault tolerance scheme, employing redundant 
components and a reconfigurable interconnection structure is 
required. Such a scheme can allow proper operation even in 
the presence of defects. This will increase the yield of ‘good’ 
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circuits at the manufacturing stage, and can perhaps also be 
used to increase the reliability of the system in the operating 
stage. 


A number of fault tolerance schemes for VLSI processor 
arrays|3-9] have been proposed by other researchers. The ob- 
jective is to reconfigure the failure free processors in the physi- 
cal array into a desired specific logical computational topology 
to best match a given parallel algorithm. The effectiveness of 
such a fault tolerance scheme is generally evaluated on follow- 
ing three criteria. 


e Processor utilization - defined by the ratio of the number 
of processors actually utilized as the computing nodes in 
the reconfigured array, to the total number of failure free 
processors actually realized in the physical array. When 
each processor is relatively large, almost all the chip area 
is taken up by processors. Then this ratio also reflects the 
chip area utilization. Because the cost of a VLSI circuit is 
significantly influenced by the chip area, this factor eval- 
uates the cost effectiveness of the fault tolerance scheme. 


Maximum reconfigured edge length - defined by the max- 
imum distance between any pair of two communicating 
nodes after the reconfiguration. This factor limits the ex- 
ecution speed of the system, and is particularly critical 
in systolic designs where the processors operate in tight 
synchronizations and the system clock must be slowed 
to accommodate the longest delay. Since the most im- 
portant benefit obtained from wafer scale integration is 
perhaps increased performance (because of the elimina- 
tion of the off chip delay), the maximum reconfigured 
edge length should be as short as possible. 


Reconfiguration overhead - defined by the overhead re- 
quired for the reconfiguration such as channel width, num- 
ber of switches and their complexity. 


In this paper, we present a new fault tolerance scheme 
which enables the efficient embedding of two important compu- 
tational topologies, namely the rectangular array and complete 
binary tree on a host array of processors. The proposed scheme 
maximizes the utilization of the failure free processors in the 
host array with short maximum reconfigured edge length. It 
also works consistently well even when the faulty processors 
in the host array are severely clustered. The reconfiguration 
overhead is found to be small considering the high efficiency. 


The rest of the paper is organized as follows. In Section 
2, we present a scheme for reconfiguring the rectangular array 
when a significant number of processors in the host array are 


faulty. Also, the interconnection structure of the host array 
realizing the reconfiguration is presented. The proposed scheme 
is compared with other designs in Section 3. Section 4 shows 
how our scheme can also be used to reconfigure the complete 
binary tree architecture. Section 5 concludes the paper. 


2 Design for Rectangular Arrays 


In this section, a reconfiguration scheme for embedding 
the rectangular array on the 2-dimensional processor array with 
faulty elements is proposed. Also, the interconnection structure 
of the host array of processors which realizes the reconfiguration 
is presented. 


2.1 Reconfiguration Scheme 


In reconfiguring a rectangular array, let us assume that the host 
array is an N x N rectangular array and a rectangular array 
with equal sides is desired to be reconfigured. The reconfigu- 
ration scheme is now presented for two cases of the number of 
the faulty processors(F) in the host array; i) F < 2N — 1, ii) F 
> 2N. 


2.1.1 Case 1: F <2N—1 When the number of faulty 
processors(F') in the host array is not greater than 2N — 1, 
maximally (NV — 1) x (N — 1) processor array can be reconfig- 
ured because there exist at least N? — (2N — 1) = (N - 1)? 
failure free processors. To obtain the desired (N — 1) x (N —1) 
rectangular array with optimum maximum restructured edge 
length, the bipartite matching[10] algorithm is employed. For 
the matching, it is regarded that an (N — 1) x (NV — 1) logical 
grid is overlayed on the original host array. Figure 1 shows that 
a 4x4 logical grid is overlayed on the 5 x5 host processor array. 
It is also assumed that only the four surrounding processors of 
each grid point can be matched to it. The maximum bipartite 
graph matching is then sought between the (N — 1)? logical 
grid points and their neighboring failure free processors in the 
host array. When the complete matching is accomplished, the 
(N — 1)? failure free processors in the host array are assigned 
the logical indices. The desired (N —1) x (N —1) processor ar- 
ray can be obtained actually by realizing the interconnections 
of the logically neighboring processors through the interconnec- 
tion circuitry. The interconnection structure of the host array 
which realizes the logical reconfiguration will be presented in 
Section 2.2. 


The best known bipartite matching algorithm finds the 
matching in O(|V|!/? e|E|) time. Here |V| and |£| is the num- 
ber of vertices and edges in the bipartite matching graph. Be- 
cause each processor in the host array can be matched to four 
surrounding logical points and each logical point has four pro- 
cessors assignable to it, the matching is highly flexible. Con- 
sequently, it is highly likely that a complete match can be 
achieved. Computer simulations reveal that the likelyhood of 
the complete matching (for reconfiguring an 8 x 8 array) is 
98.7% when 10 processors are faulty in the 9 x 9 host array. It 
is 93.8% and 28% when 13 and 17 processors are faulty, respec- 
tively. Figure 2 demonstrates that a 4 x 4 array is reconfigured. 
out of a 5 x 5 host array where 9 processors are faulty. In the 
figure, the boxes marked as ‘X’ denote the faulty processors. 
Two digit number (ij) in each box indicates the row(i) and col- 
umn(j) of the logical node to which the processor is matched. 
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Note that the maximum reconfigured edge length is very short 
and bounded to be the length of one side of a processor when 
the complete matching is possible because the logically neigh- 
boring nodes are matched to the failure free processors which 


are physically neighboring. 


The complete matching is not possible when all four pro- 
cessors surrounding a logical grid point are faulty, or the faulty 
processors are clustered in such a way that all logical grid points 
cannot be assigned even though there exist enough number of 
failure free processors, as shown in Figure 3(a). Three logical 
grid points - (2,2), (2,3) and (2,4) - are not matched the fail- 
ure free processor due to such fault clustering. Observe that 
three failure free processors in the fifth column of processors in 
the host array are remained unused. When complete match- 
ing is not possible, the desired size of rectangular array can 
be reconfigured by increasing the maximum reconfigured edge 
length as 2 (two times of the length of one side of a proces- 
sor). Here, the unmatched grid point can be matched a failure 
free processor by borrowing it from a neighboring logical grid 
point which was matched a failure free processor. The neigh- 
boring logical grid point is now required to borrow other one 
to make up the processor lent to its neighbor. This process of 
borrowing is propagated to the grid point which can now be 
matched the unused failure free processor. We call this process 
as assignment borrowing. When multiple assignment borrow- 
ings are required due to multiple unmatched logical grid points, 
then the borrowing paths (horizontal and vertical) between the 
logical grid points and the unused failure free processors are se- 
lected with the constraint that they do not cross each other. 
Note that the number of unused failure free processors is al- 
ways greater or equal to that of the unmatched grid points. 
Then, it is heuristically clear that such non crossing paths be- 
tween these two sets of points always exist. Figure 3(b) shows 
the assignment borrowings for three unmatched grid points in 
Figure 3(a). Observe that the borrowing paths always pass 
through physically neighboring processors and the faulty pro- 
cessors which do not need to be connected. This ensures that 
the assignment borrowing can always be realized by a fixed in- 
terconnection structure which will be presented in Section 2.2. 
A reconfigured 4 x 4 processor array is shown in Figure 3(c). 


As can be seen in the examples, the proposed scheme for 
reconfiguring the rectangular array based on bipartite match- 
ing and assignment borrowing always enables us to reconfigure 
the maximum size of rectangular array with short and bounded 
maximum reconfigured edges when the number of faulty pro- 
cessors in the host array does not exceed 2N — 1. 


2.1.2 Case 2: F > 2N—1 When the number of faulty 
processors(F) exceeds 2N — 1, the maximum size of array that 
can be reconfigured from the host array can be easily seen to 
be M x M where M is |V N? — F'|. The desired M x M rect- 
angular array is proposed to be reconfigured by overlaying M 
rows and columns of logical grid appropriately on the N x N 
host array and applying the same bipartite matching and as- 
signment borrowing algorithm as for the Case 1. Recall that 
an N x N array is regarded that it contains N — 1 rows and 
columns of logical grid points. Therefore, selecting M rows and 
columns of logical grid points to be matched out of (N —1) row 
and columns of logical grid points in the host array is equiv- 
alent to selecting (N — 1) — M rows and columns. The rows 
and columns to be selected and excluded from the matching 


are determined by the number of the faulty processors along 
with each row and column of logical grid points. Thus, for 
each column of logical grid points, the faulty processors in the 
column of processors to the left and right of it are counted up. 
Similarly, for each row, the number is obtained by scanning 
both the upper and lower row of processors. Then, the row or 
column of logical grid points with the largest number of faulty 
processors is first excluded. The faulty processors along with 
the excluded row or column is now assumed to be failure free 
because the faults have already been reflected by the exclusion. 
Next, for each row and column except the excluded one, the 
number of faulty processors is counted again and the row or 
column of the largest number is excluded. This procedure is 
repeated until (NV — 1) — M rows and columns of logical grid 
points are excluded in the array. Figure 4(a) and (b) show such 
exclusions when 16 processors are faulty in a 5 x 5 host array. 
In this example, the largest reconfigurable rectangular array is 
3 x 3, and one row and column are required to be excluded. 


After the exclusions, the bipartite matching is applied be- 
tween the selected M x M grid points and their neighboring 
failure free processors. Also, the assignment borrowing is em- 
ployed if the complete matching is not possible. Recall that 
the assignment borrowings are always possible as long as more 
unused failure free processors than the unmatched logical grid 
points reside in the host array. This condition has already 
been guaranteed by excluding (N — 1) — M rows and columns 
of logical grid points in the matching. A 3 x 3 rectangular 
array being reconfigured through the matching and borrowing 
algorithm, after the exclusion of Figure 4(a) and (b), is illus- 
trated in Figure 4(c) and (d). Here the logical grid point (33) 
is matched a failure free processor by assignment borrowing 
through the logical grid points (23) and (13). The maximum 
reconfigured edge length, when some exclusions are necessary, 
is bounded by the maximum size of consecutive exclusions of 
row or column of logical grid points. 


Another example of reconfiguration, where the faults are 
clustered severely is shown in Figure 5. Here all 16 processors in 
the half bottom of the 5x5 host array are faulty. This example 
shows that the proposed scheme can allow us to reconfigure 
the maximum size of rectangular array even for such extreme 
clustering of faulty processors. 


We have discussed the reconfiguration of the rectangular 
array when some processors in the host array are faulty. As 
shown in both cases of considerations, the proposed reconfig- 
uration scheme can always reconfigure the maximum size of 
array. The efficiency of the proposed scheme is not influenced 
by the distribution of the faulty processors and this is one of 
the most important characteristics of the proposed design. 


Next we present the interconnection structure of the host 
array which realizes the desired rectangular array. 


2.2 Structure of Host Array 


Each processor in the rectangular array requires four ports for 
the four directional communications such as North(N), East(E), 
South(S) and West(W). We put each port at each corner of the 
processor block as shown in Figure 6. Figure 6 shows the struc- 
ture of a processor block where an interconnection bus(dotted 
line) is implemented around it. Recall that a processor can 
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be matched to one of the four neighboring logical grid points 
reside at upper-right, upper-left, lower-right and lower-left of 
it. Figure 6(a) shows the interconnection pattern when a pro- 
cessor is matched to the logical grid point at upper-right. The 
other three types of interconnection pattern are illustrated in 
Figure 6(b),(c) and (d). 


The actual interconnection for reconfiguring the rectan- 
gular array is achieved by two steps. First, in each site of 
processor which has been matched to a logical grid point, ap- 
propriate interconnection pattern is realized according to the 
type of matching. Then, the logically neighboring processors 
are interconnected each other by connecting two proper ends 
of connections which are made in the previous step. As shown 
in Figure 6, a channel width of two is enough to realize almost 
all kinds of interconnection patterns. One extra bus is required 
only when two physically and horizontally(vertically) neigh- 
boring processors are matched to two vertically(horizontally) 
neighboring logical grid points between them as shown in Fig- 
ure 7. The matching of two physically and horizontally neigh- 
boring nodes which are matched to the logical grid points (31) 
and (41), or (14) and (24) in Figure 2 are examples of those 
requiring an extra vertical bus between processors. However, 
the matching which requires an extra bus occurs only when 
the fault distribution does not allow us to avoid such a pat- 
tern. Therefore, a channel width of three is always sufficient 
for realizing all interconnections. Figure 8-11 show the actual 
interconnections realized for the reconfigured rectangular ar- 
rays shown in Figure 2-5, respectively. 


In the next section, we compare the efficiency of our design 
with other classical designs presented in the literature. 


3 Comparisons with Other Designs 


In this section, the proposed scheme for reconfiguring the 
rectangular array on the 2-dimensional processor array is eval- 
uated and compared with the other classical designs such as 
hierarchical scheme[5], column redundant scheme[6] and fault 
stealing scheme[7]. These schemes are compared on the pro- 
cessor utilization, maximum reconfigured edge length and the 
reconfiguration overhead. 


3.1 Comparison of Processor Utilization 


In carrying out the comparison, the objective is to obtain a 
computational array of fixed desired size(M x M). Therefore, 
for each scheme, the optimum size of host array which gives 
the best processor utilization are found for the given processor 
yield(P; probability that each processor is failure free) from 0.4 
to 0.9 in steps of 0.1. 


3.1.1 Proposed Scheme As discussed in the previous 
section, the proposed scheme can always reconfigure the desired 
size(M x M) of array whenever at least M? processors in the 
host array are failure free. Therefore, the yield of an M x M 
array out of a R x C host array is readily seen to be 


RS ( Rxe 
yield = ( PY(1& p\Rxe= 1 
2 eee (1) 
The processor utilization(PU) is then obtained by 
M? x yield 
PU = ————_— 2 
RxCxP (2) 


The optimum size of host array, here R and C, which gives 
the best processor utilization(PU) for the given P(processor 
yield) is found by the equation (1) and (2). 


3.1.2 Hierarchical Scheme This scheme uses redun- 
dant submodules([5] for extracting the desired rectangular ar- 
ray. The objective is to ensure, through redundancy, a very 
high probability that a failure free 2 x 2 submodule can be 


reconfigured at each submodule site, so that row and column — 


exclusion(4], employed at a higher level to protect against such 
failures is very rarely needed. Here we need to find the opti- 
mum size of submodule (R x C) that guarantees a 2 x 2 failure 
free processor array. The yield and PU can be shown to be 
given by 


RxC 

gilda ce ( R é C Pi _ pyRxc-4)(/2)? (3) 
1=4 

PU = 4x yield 


= 4 
RxCxP (4) 


3.1.3 Column Redundant Scheme An Mx C(C > 
M) array is used for the reconfiguration of an M x M rectan- 
gular array in this scheme. A failure free linear array of size M 
is reconfigured out of C processors in each row, and then the 
desired rectangular array is finally obtained by connecting the 
M failure free linear arrays vertically through the interconnec- 
tion channels between each row of processors. The yield and 
PU can be seen to be given by 


C 
yield = ( > 


4=M 


( - P*(1— P)o-*)™ (5) 
_ Mx yield 


PU = 
CxP 


(6) 
Similarly, the optimum size of column(C) is found using 
these two equations. 


3.1.4 Comparisons Table I(a) and (b) show that the 
size of host array which gives the best processor utilization for 
reconfiguring a desired size of array of 8x 8 and 16 x 16, respec- 
tively. Observe that the size of the host array of the proposed 
scheme is always much smaller than that of the other schemes. 
Therefore, it can be expected that the processor utilization is 
much better. Figure 12(a) and (b) show the plots of the proces- 
sor utilization for both sizes of array, respectively. As expected, 
the processor utilization of the proposed scheme is much bet- 
ter than that of other designs for the whole range of processor 
yield and the size of array. It is about 30% and 40% more effi- 
cient than the column redundant scheme and the hierarchical 
scheme, respectively. Because the proposed scheme always re- 
configure the maximum size of array, the processor utilization 
can be said to be near optimal. 


The fault stealing schemes proposed in [7] borrow failure 
free processors from the upper row of processors to replace the 
faulty processors which cannot be replaced by the redundant 
processors in the same row. Simulation data from [7] demon- 
strate that the yield for reconfiguring a 20 x 20 processor array 
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out of a 21 x 21 array is less than 10% when more than 30 pro- 


cessors are faulty. Recall that the yield of proposed scheme is 
always 100% as far as enough number of processors are failure 
free (400 in this example). 


3.2 Comparison of Maximum Edge Length 


As we can see from Table I, the proposed scheme requires the 
smallest size of the host array to reconfigure a desired size of 
rectangular array. This means that the maximum reconfigured 
edge length of our design is smaller than that of other designs 
because each processor, which is matched to a logical grid point 
(overlayed on the host array regularly), is the physical neighbor 
to it and thus they can be regarded to be layed out regularly on 
the host array. Note that the maximum edge length of a rect- 
angular array in a fixed area is minimal when nodes are equally 
separated from each other. Also recall that the borrowing pro- 
cess does not affect the reconfigured edge length substantially 
because the borrowing occurs between two neighboring grid 
points. 


3.3 Comparison of Reconfiguration Overhead 


For the proposed scheme, a channel width of three is enough 
for reconfiguring the rectangular array irrespective of the num- 
ber of faulty elements in the host array and its distribution. 
Note that the processor utilization of the other schemes was 
obtained with the assumption that the sufficient channel was 
provided for the reconfiguration. For example, from Table I, 
the column redundant scheme requires an 8 x 18 array to re- 
configure an 8 x 8 array most efficiently when processor yield 
is 0.7. Here, the channel width of at most 10 is required to 
realize all patterns of the reconfiguration. If the channel width 
is limited to three, then the yield and the processor utilization 
decreases significantly. | 


Also, it can be argued that the channel width of three is 
relatively small considering the high efficiency of the proposed 
scheme. Even the simple bypassing scheme[4] requires one bus 
around each processor (equivalent to a channel width of two) 
even though its efficiency is known to be very poor. It can also 
be expected that the efficiency of the proposed scheme will not 
degrade significantly even with a channel width restricted two. 


From the comparisons, we can see that the proposed scheme 
can always reconfigure the desired size of rectangular array 
from the smaller host array with high yield (efficiency) and 
short maximum reconfigured edge length. Also the reconfigu- 
ration overhead, measured in terms of channel width, is rela- 
tively small. We next present a highly efficient fault tolerant 
tree embedding scheme that uses the same reconfiguration al- 
gorithm and interconnection structure that we have presented 
for the rectangular array. 


4 Design for Tree Architecture 
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A near optimal scheme for embedding a complete binary tree 
in an array of failure free processors with planar interconnec- 


tions was proposed in [12], which substantially improves the 
efficiency of [13]. This scheme adopts a hierarchical strategy 
such that any required size of tree larger than four levels is 
embedded by connecting an appropriate number and type of 
basic modules shown in Figure 13. Observe that all nodes at 
adjacent levels (except for two nodes in basic module M2) are 
physical neighbors and can be connected with short links. 15 
out of the 16 processors in the basic module are utilized as the 
tree nodes in each four level leaf subtree. The remaining unused 
processor in each basic module is used as a tree node at some 
higher level, when the basic modules are connected together to 
build a larger tree. Figure 14 shows a 9 level tree embedded 
using the basic modules. Because only one processor is always 
left unutilized in the implementation for any size of tree, the 
area efficiency quickly converges to 100% as the size of the em- 
bedded tree grows large. The propagation delay between the 
root node and leaf node is also very short. It converges to the 
theoreticallower bound as the size of tree grows large, as proven 
in [12]. 


4.2 Fault Tolerant Tree Embedding 


A tree architecture can be efficiently embedded in processor 
array with faulty nodes when the reconfiguration scheme pro- 
posed for rectangular array is combined with the tree embed- 
ding scheme on failure free processors introduced in the previ- 
ous subsection. As shown in Figure 14, the tree architecture 
is constructed using a number of basic modules of a 4 x 4 pro- 
cessor array containing a 4 level subtree. Here, we propose 
to construct a tree of desired size by reconfiguring each basic 
module of a 4 level tree efficiently and then connecting them 
appropriately. To obtain the desired tree architecture, first, a 
4 x 4 processor array is reconfigured from the host submodule 
using the scheme proposed for the rectangular array. Then, the 
interconnection pattern for each type of basic modules are re- 
alized in each submodule. Figure 15 is an example that a basic 
module of type 2 (M2) can be embedded in a 5 x 5 processor ar- 
ray of host submodule assuming the same fault distribution and 
the reconfiguration as in Figure 2. In the figure, for every pro- 
cessor in the host submodule, the I/O port for the connection 
to the parent node is located at the upper-left corner of each 
processor site. Also, two other ports for children are located at 
the lower-left and upper-right corner, respectively. Note that 
this position needs to be shifted 90°, 180° or 270° all together 
according to the orientation of the basic modules inside of the 
host array to prevent the interconnections crossing over each 
other. For instance, the basic module M1 at the upper-left cor- 
ner of the host array of Figure 14 has 180° shifted I/O ports 
such that the port for the parent node is located at the lower- 
right corner of the processor module. The main reason why 
the hierarchical scheme using submodules with different orien- 
tation of ports is employed here is that some interconnections 
can cross over each other if the three positions of I/O port are 
same for all the processors in the host array. 


4.3 Comparison of Efficiency 


Let’s denote Y;, as the yield of a k level tree. Then the yield and 
PU assuming same notations as used in the previous section can 
be shown to be given by 
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The optimum size of submodule which gives the best PU 
is found from the above equations. Figure 16 shows the com- 
parison of the processor utilization for embedding an eight level 
tree with other two designs presented in [14] and [15], respec- 
tively. The row exclusion scheme|14] excludes the entire row 
of processors containing the faulty processor, where the host 
array is CHiP[9] architecture. In the modular scheme[15], each 
module which contains a spare processor for the replacement of 
a faulty processor in that module constructs the whole tree ar- 
chitecture. We can see that, from the figure, the improvement 
is quite significant. The efficiency of the proposed scheme has 
also been found to be better than that of other designs such 
as SOF T(16} and Cluster Proof[17] design. The maximum re- 
configured edge length and the reconfiguraion overhead is ex- 
pected to be smaller than those of other designs due to the high 
processor utilization and the compact layout of the processor 
arrays. 


5 Conclusion 


A highly efficient design for reconfiguring the rectangular 
array and the binary tree architecture in the presence of a sig- 
nificant number of faults is presented. By employing bipartite 
graph matching with an assignment borrowing algorithm, the 
proposed scheme always allows reconfiguration of the maximum 
possible size of array. Also the maximum reconfigured edge is 
inherently short. The proposed scheme can reconfigure the de- 
sired structure successfully even when the faulty processors in 
the host array are severely clustered as might be realistically 
expected. 


A heuristic strategy which excludes some row and column 
of logical grid points on the host array is suggested. Also, multi- 
ple paths for the assignment borrowing algorithm are suggested 
based on planarity considerations. Reconfiguring other impor- 
tant computational topologies using the algorithms proposed 
in this paper is also under investigation. 
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Figure 2. Reconfiguration 
of a 4 x 4 array using 
bipartite matching. 


Figure 1. Overlaying a 
4 x 4 logical grid on 
a 5 x 5 host array. 


380 


Processor Proposed Hierarchical Column redundant 
yield scheme scheme scheme 
| 09 || 9 x 9(=81) 8 x 12(=96) 8 x 12(=96) 


| 9 x 10(=90) | 12 x 12(=144) 8 x 14(=112) 
| 0.7 sft: 10 x 10(=100) | 12 x 12(=144 8 x 18(=144) | 
11 x 11(=121) | 12 x ate :192) | 8 x 21(=164) 


“12 x 12(=144) | 16 x 16(=256) | 8 x 27(=216) 


[04 [[ 13 x 44(=182) | 16 x 20(=320) [8 x 38(=280) _| 


(a). For an 8 x 8 array. 


Processor Proposed Hierarchical Column redundant 
yield scheme scheme scheme 


[09 || 17 x 18(=900) | 24 x 24(=076) | 16 x 25(=308) | 


0.8 18 x 19(=342) | 24 x 24(=576) 16 x 27(=432) 
33 — Hosea | 24 x 32(- =768) 16 x 33(=528) 
[0.6 _]f 21 x 22(=462) | 32 x 32(=1024) | 16 x 40(=040) _| 
| 24 x 24(=576) [32 x 40(=i280) [16 x 50(=800) _ 


| 27 x 27(=729) | 40 x 40(=1600) | 16 x 64(=1024) 


(b). For a 16 x 16 array. 


Table I. Size of host array which gives the best processor 
utilization for reconfiguring a rectangular array. 
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(a) Incomplete matching. (b) Assignment borrowing. 
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Figure 3. Reconfiguration of a 4 x 4 array when the 
complete matching is not possible. 
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(a) Exclusion of the third 
row of logical grid points. 


(b) Exclusion of the fourth 
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(c) Bipartite matching. 


Figure 4. Reconfiguration of a 3 x 3 array out of a5 x 5 
host array using all the failure free processors. 
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(c) Bipartite matching. (d) Assignment borrowing. 


Figure 5. Reconfiguration of a 3 x 3 array when all the 
processors in the bottom half of the host array are faulty. 


column of logical grid points. 


(d) Assignment borrowing. 


column of logical grid points. 
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(a) To upper-right. 


(c) To lower-left. 


(b) To upper-left. 


(d) To lower-right. 


Figure 7. Matchings requiring 


Figure 6. Four patterns of 
one extra channel. 


interconnection realization. 


Figure 8. 
Interconnection 
for Figure 2. 


Figure 9. 
Interconnection 
for Figure 3. 


Figure 10. 
Interconnection 
for Figure 4. 


Figure 11. 
Interconnection 
for Figure 5. 
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(b) 16 x 16 array. 


Figure 12. Comparison of Processor Utilization. 
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Figure 13. 
Four types of 
basic module. 
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Figure 14. 9 level tree embedding 
using basic modules. 
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Figure 15. Reconfiguration of a four level tree 
for the same distribution of faults as in Figure 2. 
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Figure 16. Comparison of Processor Utilization 
for reconfiguring an eight level tree. 
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Abstract- 


Computation requirements for an integrated vision system are 
tremendous and thus a need for parallel processing. There are several 
tasks which must be performed in a sequence repeatedly. Each of these 
tasks have a great potential for spatial and temporal parallelism. In 
general, the degree of exploitable parallelism is high but dynamically 
variable. Therefore, efficient utilization of resources in a multiprocessor 
vision system requires the system to be highly flexible and modular. In 
this paper we consider an architecture for an integrated vision system. 
Then we illustrate how various steps involved in an integrated vision 
system which consist of low level, high level and hybrid algorithms can 
be efficiently mapped in an integrated environment. In particular, we 
consider stereo vision algorithm to extract 3-D object description from a 
set of 2-D images. The emphasis is on using small number of powerful 
processors concentrated in clusters and connected via flexible, 
reconfigurable and programmable crossbar. The issues considered are 
mapping algorithms independent of problem size, minimize 
communication, efficient pipelining of tasks and load balancing to evenly 
distribute the computation. We argue why the architecture is efficient as 
an integrated vision system. Furthermore, we show how various steps of 
the algorithm can be mapped onto the architecture with a brief 
description of each step of the algorithm. 


I. Introduction 


Computer vision has been regarded as a very complex problem. 
Image analysis and understanding procedures employ a very broad 
spectrum of techniques from several areas such as signal processing, 
advanced mathematics, graph theory, and artificial intelligence. These 
algorithms are, in general, characterized by massive parallelism. For low 
level processing, spatial decomposition of an image provides a natural 
way of generating parallel tasks. For higher level analysis operations, 
parallelization may also be based on other image characteristics. The 
multi-dimensional divide-and-conquer paradigm [1] is an attractive 
mechanism for providing parallelism in both of the above cases. In [2], 
Ahuja and Swamy proposed multiprocessor pyramid architecture as a 
straight forward implementation of the divide-and-conquer based 
approach. Such pyramids are natural candidates for executing divide- 
and-conquer algorithms as they most closely mirror the flow of 
information in these algorithms. However, design of an integrated vision 
system requires a greater flexibility, partitionability, and reconfigurability 
than is offered by regular array or pyramid structures[3]. 


Many multiprocessor architectures and parallel algorithms have 
been proposed to solve the problem of image understanding [4,5,6, 7,8]. 
-Most architectures such as pyramid, array processors, and mesh have 
limited capabilities to implement an integrated system for image 
processing due to several reasons. First, they are mostly suitable for 
SIMD type of algorithms which only constitute low level vision 
operations. Second, the architectures are inflexible due to the rigid 
interconnections between processors and processors and memory. Third, 
the number of processors needed to solve a problem of reasonable size is 
-hundreds or thousands. Such a large number of processors is not only 
cost prohibitive, but the processors themselves cannot be very powerful 
and can have only limited features due to technological limitations. 
Fourth, it is normally assumed that the problem size exactly matches the 
number of processors available. Most of the time it is not clear how to 
adapt algorithms so that problems of different sizes can be solved on the 
same number of processors efficiently. Finally, the problem of input- 
output of data is rarely addressed in any of these architectures. It is 
important to note that no matter how fast or powerful a particular 
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architecture is, its utilization can be limited by the bandwidth of the I/O. 
Significant research is being carried out in developing architectures and. 
algorithms for image processing which are practically feasible. One good 
example is the CMU Warp processor [9, 10,11, 12]. The machine has a 
programmable systolic array of linearly connected cells, each capable of 
10 MFLOPS. The array can efficiently perform local operations, in which 
each output depends on a small corresponding area of the input, since the 
connections between the cells are neighbor connections. It is also claimed 
that Warp is also suited for global*image operations. 


An integrated vision application contains several algorithms in a 
sequence with input of one dependent on the the output of the previous 
algorithm and externally supplied parameters. Some of the algorithms are 
suited for SIMD architectures, some for MIMD and _ systolic 
architectures. Several issues such as efficient parallel mapping of 
individual algorithms, communication between tasks, data transfer, 
scheduling tasks etc. must be addressed. For example, stereo vision 
algorithms to obtain 3-D surface information from 2-D images is one 
such application[13, 14, 15] consisting of tasks such as edge detection, 
matching, hough transform, fitting, surface interpolation. There is a scope 
of considerable parallelism within each task and of pipelining tasks. 


In this paper we consider an integrated vision architecture and 
describe its features. Then we argue why the architecture is efficient as an 
integrated vision system. Furthermore, we show how various steps of the 
algorithm can be mapped onto the architecture with a brief description of 
each step of the algorithm. 


This paper is organized as follows. Section 2 presents the 
architecture. In Section 3, the mapping of the various steps of the stereo 
vision algorithms is described. Finally, summary and a few remarks 
about future work are presented. 


Ii. Architecture 


Figure 1 shows an architecture (called NETRA) for a large high 
performance multiprogrammed multiprocessor for image analysis and 
understanding systems. The architecture consists of the following 
components :- 


(1) A large number (100 - 10000) of Processing Elements (PEs), 
organized into clusters of, say, 16 to 64 PEs each. 

(2) A tree of Distributing-and-Scheduling-Processors (DSPs) that 
make up the task distribution and control structure of the 
multiprocessor. 

(3) A parallel pipelined shared Global Memory. 

(4) An Interconnection Network that links the PEs and DSPs to the 


Global Memory. 
The system is illustrated with a block diagram in Fig. 1. 


A. Processor Clusters 


The clusters consist of, say, 16 to 64 PEs, each with its own 
program and data memory. They form a layer below the DSP-tree, with a 
leaf DSP associated with each cluster. PEs within a cluster also share a - 
common data memory. The PEs, the DSP associated with the cluster, and 
the shared memory are connected together with a crossbar switch. The 
crossbar switch permits point-to-point communications as well as 
selective broadcast by the DSP or any of the PEs. 


Clusters can operate in an SIMD mode, a systolic mode, or an 
MIMD mode. Each PE is a general purpose processor with a high speed 
floating point capability. In an SIMD mode, PEs in a cluster execute 
identical instruction streams from private memories in a lock-step 
fashion. In the systolic mode, PEs repetitively execute an instruction or 
set of instruction on data streams from one or more PEs. In both cases, 


communication between PEs is synchronous. In the MIMD mode PEs 


asynchronously execute instruction streams resident in their private 
memories. The streams may not be identical. 


B. The DSP Hierarchy 


The DSP-tree is an n-tree with nodes corresponding to DSPs and 
edges to bi-directional communication links. Each DSP node is composed. 


of a processor, a buffer memory, and a corresponding controller. 


The tree structure has two primary functions. First it represents the 
control hierarchy for the multiprocessor. A DSP serves as a controller for 
the subtree structure under it. Each task starts at a node on.an,_ appropriate 
level in the tree, and is recursively distributed at each level of the sub-tree 
under the node. At the bottom of the tree, the sub-tasks are executed on a 
processor cluster in the desired mode (SIMD or MIMD) and under the 
supervision of the leaf DSP. 


The second function is that of distributing the programs to leaf 
DSPs and the PEs. Vision algorithms are characterized by a large number 
of identical parallel processes operating on different data sets. It would be 
highly wasteful if each PE issued a separate request for its copy of the 


program block to the global memory because it would result in large 


unnecessary traffic through the interconnection network. Under the DSP- 
hierarchy approach, one copy of the program is fetched by the controlling 
DSP (the DSP at the root of the task subtree) and then broadcast down the 
subtree to the selected PEs. | 


C. Global Memory 


The multiport global memory is a parallel-pipelined structure. 
Given a memory(chip)-access-time of T processor-cycles, each line has T 
memory modules. It accepts a request in each cycle and responds after a 
delay of T cycles. Since an L-port memory has L lines, the memory can 
support a bandwidth of L words per cycle. 


Data and programs are organized in memory in blocks. Blocks 
correspond to "units" of data and programs. For example, in the case of a 
graph matching algorithm for a symbolic-matching task, each block may 
be a record containing all information corresponding to one node of the 
graph. The size of a block is, hence, variable and is determined by the 
size of a record for a task. A large number of blocks may together 
constitute an entire program or an entire image. Memory requests are 
made for blocks. The PEs and DSPs are connected to the Global 
Memory with a packet- switching multistage interconnection network. 


The global memory is capable of queuing requests made for blocks 
that have not yet been written into. Each line (or port) has a Memory-line 
Controller (MLC) which maintains a list of read requests to the line and 
services them when the block arrives. It maintains a table. of tokens 
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Fig 1: Organization of NETRA 


corresponding to blocks on the line, together with their length, virtual 
address and full/empty status. The MLC is also responsible for virtual 
memory management functions. 


III. Mapping the Stereo Vision Algorithm 


Assume that I is an input to an image processing task f and the 

output is O. That is, 
0=f® 

For example f may be edge detection, filtering, Fourier transform or 
object recognition. Therefore, f has several characteristics for parallel 
implementation. First, an identical, data independent, local operation is 
performed throughout an entire image on small quadrant of windows. 
This spatial characteristics implies that an image may be divided into a 
set of subimages which can be processed in parallel. Second, often 
several such functions are applied in a sequence to an image. For 


- example, stereo vision for 3-D object extraction which uses convolution, 


matching, Hough transform, fitting etc. These temporal characteristics 
suggest the use of a pipeline environment to improve the processing rate. 
In summary, an overall processing function can be partitioned into 


-several subfunctions which are pipelined yielding advantages of both 


384 


spatial as well as temporal parallelism. 


Architecture and the Model 


(1) Processor Allocation : Each stage of the image processing function 
can be performed on one or more clusters of processor. Spatial 
parallelism within each stage of the pipeline can be exploited using the 
flexible interconnect of the cluster and local DSP for scheduling. 


(2) Pipelining : Pipelining between various stages can be achieved using 


the macro dataflow feature of the architecture. Once an output data block 


is produced by the previous task, it is sent to the global memory with 
appropriate address of the next cluster needing this data (details are given 
in{3] 

(3) Task Scheduling : Scheduling of tasks can be performed locally in a 
cluster for MIMD mode by all processors sharing the load information in 
the common data memory. If more than one clusters are involved then an 
appropriate DSP can schedule tasks and dynamically allocate processors 
to various subtasks if the computation is heavily input data dependent and 
unevenly distributed. 


We now illustrate how the various algorithms that are part of stereo 
vision can be implemented in parallel on the proposed architecture. We 
discuss the computation, communication and scheduling issues along 
with the data structure and dataflow requirements. We consider various 
algorithms individually and describe their possible parallel version on 
NETRA and suggest how they can be integrated and pipelined. 


Figure 2 shows the data flow and main data structures for the 
integrated stereo vision algorithm. The figure only illustrates the 
computation and communication needs tor the left input image. Exactly 
the same computation and communication is also done for the right input 
image. The type of algorithm suited for each step (such as SIMD, MIMD 
or a mix of SIMD and MIMD) is also indicated with the task blocks. The 
following is a description of the tasks shown in Figure 2. 
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Data flow and Data Structures 
Fig.2 Computation, Communication and Data 
flow for Stereo Vision 


(1) Data Compression and Edge Detection : For Coarse-to-Fine 
‘processing, it is required to compress the input image data two or more 
‘times depending on resolution of the image. Data compression involves 
‘computing the weighted average. Each pixel in the output (compressed) 
image is the weighted average of a certain size neighborhood in the 
original (uncompressed) image. Compression is a variation of 2-D 
convolution (except for the fact that unlike convolution, output is not 
taken at each point, but every n point, where n depends on how much 
compression is needed). Both edge detection and data compression 
involve a variation of 2-D convolution, therefore, we describe a 
convolution algorithm. A 2-D convolution is expressed as follows: 
GG j)=wis * 1G) 

where, I(i,j) is the image, w;,; is the convolution window and G(i,j) is the 
output of the convolution operation. 


Edges are used as features to be matched in left and right images. 


The matched edges will then result in depth points. This algorithms uses 


the zero crossings of the convolution of the image with the V2G operator 
to determine the edge location in the image. There are numerous parallel 
algorithms available for 2-D convolution on various architectures [16]. 
Furthermore, there are special purpose integrated circuits available to 
‘perform convolution. We describe a parallel algorithm for edge detection 
on the processor clusters on NETRA. 


The approach is to reduce 2-D convolution to a 1-D convolution 
without incurring additional steps. Each pixel is logically mapped onto a 
separate processor (as if there were as many processors available as there 
are pixels). Actually the image is folded and multiple pixels are mapped 
onto one processor. The image is folded in two dimensions in a wrap 
around fashion, both left to right, and top to bottom. If the image size is n 
‘x nm, and number of processors is p x p then each processor will have 
n*/p2 pixels in its local memory. In general, pixel (@,j) ; 
O<i<n—1, O<j<n—1 will be mapped to processor ((i mod p), i mod p)). 
Therefore, this mapping preserves the adjacency of any two pixels even 
though the image is folded. 


Assume that the window (or neighborhood) size is w x w and the 
convolution mask is stored in each PE’s memory. A small window is 


embedded in a larger one and therefore, same connections can be used for © 


a larger window size with the addition of new connections for extra steps. 


The algorithm performs the convolution by each processor distributing its 


pixel values to the neighborhood in a pipelined manner. 


In the following algorithm North, South, East and West Neighbors 
are defined in wrapped around fashion. At any step all the processors 
have the same neighbor connection. All the processors will follow 
exactly the same pattern. It should be noted that the data values at each 
processor are stored in a linear array and subscript (i,j) means the data 
value i in the connection number j. For a processor (i,j), N,S,E,W 
neighbors are defined as follows. 

N = ((i-1),j), if (i-j).< 0, then N = ((i-1 + p), j) 
S = ((i+1) mod p, j) 
__ E=(,(j+1)modp) 
W = Gi, (j-1)), if G-1) < 0, then W = (i, (j-1+p)) 


Assuming that each processor has m pixels in its local memory, the 


algorithm works as follows. For an image of size 256 x 256 and a 
processor cluster of size 64, each processor has 1K pixels. For a window 
size of 3 x 3, each processor performs 9K multiply-add operations. The 
interconnection needs to be reconfigured only eight times. It is important 
to note that the number of times the interconnection needs to be 
reconfigured only depends on the neighborhood window size. Also note 
that the algorithm can be easily adopted to any problem size and any 


processor cluster size. Once the convolution with the laplacian is: 


performed, each processor stores the zero crossings by storing its (x,y) 
position and its orientation. 


The above algorithm illustrates that SIMD algorithms can be 


mapped efficiently on to the processor clusters using the flexibility and’ 


programmability of the interconnection. Furthermore, the mapping is 
such that the interconnection reconfiguration is independent of the input 
image size. Following algorithms illustrate how MIMD and hybrid 
algorithms (algorithms needing both SIMD and MIMD type. of 
computation such as Hough Transform) can be mapped and how 
pipelining of tasks can be achieved. It should be noted that each of these 
algorithms are executed on one or more different clusters. Also, there are 
two parallel stereo algorithms being executed at all time, one for left and 
one for right input image. 
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(2) Feature Matching : Feature matching is required in order to compute 
the depth points in the image. There are several sub tasks within this task. 
Firstly, the output data of the previous step (ic. edges) needs to be 
properly organized. Secondly, since matching needs corresponding edges 
from the left(right) images, there is a need to be data transfer between 
processor clusters. Then there is a task of matching the properties of the 
edges within certain window along the epipolar lines. Epipolar lines are 


ALGORITHM CONVOLUTION; 

Input ; IM(n,n) , Output : Matrix of zero crossings Z, (n,n) 
All the processors work in SIMD lock-step fashion. 
Set up Connection_array of size wxw by choosing 
first wxw connections from the set 
{N,E,S,S,WsW,N,N.N.E.E,E,S,S,S,W,W,W,WN.N.N.NLE,..}. 


For i = 1 to m do (in parallel) 
Result(i) :=w;; * data(i) 
End_For 


For j = 1 to wxw do (in parallel) 
Set up appropriate connections on the crossbar as follows. 
connection(j) := connection_array(j) 
For i= 1 to m do (in parallel) 
Send data (pixels) on the output port to 
the connected neighbor. 
At the same time receive data from its input port. 
Result(i) := Result(i)+w;; * data(i,j) 
End_for 
End_for 
END_CONVOLUTION; 


normally assumed to be horizontal and therefore, the search is limited to 
one dimension. Therefore, the goal is to map data efficiently onto the 
processors such that the communication is minimized. We suggest the 
following. 


Each processor P; ; accumulates rows of zero crossing in such a 
way that communication is needed only in one direction without any need 
to change the interconnection. Therefore, the processors are organized 
into horizontal linear arrays in a wrapped around fashion as shown in 
Figure 3. The data is then pumped into one direction in systolic fashion 
‘and each processor accumulates appropriate rows (of edges) according to 
‘the following mapping. Processor P;,; receives the rows 

| (i mod j) + jp + kp? fork = 0,1... 
and accumulates at most n/p? rows. 


The second subtask is that of data transfer between two clusters of 
processors which are working on the left and the right image respectively. 
This transfer is achieved by macro data flow using the global memory 
between corresponding processors in two clusters. Note that this can be 
performed at the same time when rows are being accumulated. Therefore, 
‘this is an example of how pipelining of tasks can be achieved. Once each 
processor accumulates appropriate rows, the next sub task is to find edges 
which have a match in the other image. This task is highly data dependent 
because the computational needs depend on the image, and how many 
edges are there to be matched. It is possible that some parts of the image 
result in lot of edges where as other parts have relatively very few edges. 
Therefore, the processors that work on those rows having comparatively 
large number of edges may be heavily loaded and others may be under 
utilized. Therefore, there is a need for efficient task scheduling and 


Fig. 3 : Reorganization of Processors for Feature Matching 


dynamic load balancmg. ‘The above mapping of rows onto processors 
ensures that no processor has adjacent rows of edges, and in fact, no 
processor has p? adjacent rows of edges. Thus the mapping itself tends to 
evenly distribute the computation on the processors. The following 


algorithm sketches how the processors perform the task of feature 


matching in parallel. 


Each processor first works on rows allocated to it initially. Then if 

it finishes before others do, it selects rows allocated to other processors 
and performs the feature matching algorithms on them. Therefore, load 
balancing is achieved. Once the feature matching is finished, depth map 
of the image is available in both left and right image’s coordinate system. 
The next task is to perform surface fitting. 
(3) Surface Fitting : The feature matching tasks provides part of the 
depth points in each processor’s local memory. First subtask isto transfer 
the Depth_array to each processor’s local memory. This can be done in 
macro data flow mode by connecting the processors in a circular array as 
before. 


Once each processor has the Depth_array, surface fitting can be 
done in parallel in MIMD mode. The synchronization, task scheduling: 
and load balancing can be achieved using the shared common data 
memory, or the global memory if more than one cluster of processors are 
involved in this computation. The algorithm uses neuen transform 
method to fit planes onto a set of points. 


_ The input to the following algorithm is a two dimensional array of 
points (e.g, zero crossings) in which line segments are to be detected. We 
will show how a parallel algorithm can-be implemented to compute these: 
line segments using hough transform method. The computation is done. 
in the (7 ,6) parameter space. If there exists a line whose normal distance 
from the origin is 7, the normal makes an angle 9 with the x-axis then if 
the point (x,y) lies on that line than the following equation is satisfied. 

r =xcos®+ ysin® 
First of all r, 6 are quantized. The quantization depends on how much 
accuracy is required in the final result. Let’s assume that maximum value 


ALGORITHM FEATURE MATCH; 


Each Processor F; ;, 0Si <p—1, OSj<p—1 (in parallel) do 
Row_count; ; = E 
| repeat 


If Row_count; ; >O (more rows left in P; ;’s local memory) 
Select a row(Row_count; ;) from its local memory. 
Row_count; ; = Row_count; ; — 1. 

Mark row(Row_count; ;) as selected in the 
common data memory of cluster. 
Update load information in the common data memory. 
For each edge (zero crossing) in the selected row Do 
Look for match in the corresponding 
row in the other image 
If match found Then 
Compute depth point z; for point (%; y ) 
Store in Depth_Array(x; y;,2;) 
Else 
Store (x; y;) in the list of ambiguous edges 
End_For 
Else 
If Row_count; , #0 for some processors P; , Then 
Select a row from the processors with maximum Row_count. 
Mark the selected row. 
Update load information in the common data memory 

End _ If 

Until finished (i.e., no more rows left to be selected) 
End FEATURE_MATCH 


of r br 7max maximum value of 6 be Omax (generally x or 27). Then if 
Tres, 9res are the resolutions used for quantization then total number of 
accumulator cells in the computation are 7 max.Qmax/Tres Ores. The number 
of rows and columns in the accumulator array being R = Omax/0,e5 and. 
C =P max/Tres Tespectively. The mapping is as follows. ae processor 
computes all r values for its share of 0 values. If there are p? processors 
then each processor gets n =R/p?6 values to work on. Therefore, 
processor i gets to work on 76 values where, 1<i<p2. Figure 4 shows the 
accumulator array for processor i. The main resons for such a mapping 
‘are that when looking for peaks later no two processors need to 
communicate thereby reducing the communication overhead. Further, 
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the processor can store sin, cos values for its allocated n 0 values in its 


registers resulting in saving memory access delays which would occur if 


all quantized sin6 and cos values are stored with each processor in its 
local memory. A brief explanation of the algorithm is as follows. Each 
processor begins working on a small part of the image. It computes the 


required r values for each of the @ values stored in its registers. It then 


increments the appropriate count in the Acc_array. If the count increases 
beyond a certain threshold value, there exists a possibility of this being a 
local maxima. Therefore, another array called Link_array is updated 
marking this fact. This step reduces the search space tremendously when 
looking for local maxima since normally a very small fraction of the 
image contributes to lines and entire accumulator array need not be 
searched when looking for local maxima. Figure 5 shows the Link_array. 
Once the above computation is finished for the entire image, the local 
maxima are computed in the Acc_array using the information available in 
Link_array as follows. Only those locations in the Acc_array need to be 
searched which are marked in Link_array because it contains only those 
locations which are candidates for local maxima. Therefore, the search 
space is reduced. 


The second subtask in surface fitting is that of fitting quadratic 


‘patches. This algorithm falls in the category of MIMD algorithm which 
‘works on a database of information produced by the previous steps. The 
output of this algorithm is dependent on the input data globally. Two: 
‘adjacent patches are compatible if there depth orientation do not differ 
by more than a certain threshold value. Compatibility criteria is an input 


parameter to the algorithm. The planar patches at each grid point are 
placed into sets of planes such that they are mutually compatible. Now a 
variation of relaxation algorithm is applied to the planes at each grid 
point to test for mutual compatibility. This task can be done in parallel 
using MIMD mode shared memory model for synchronization and load 
balancing of the subtasks. Shared memory model is preferred instead of 
message passing because the input data size is large. Therefore, there: 
may be large communication requirements at each stage of the algorithm 
because compatibility label may propagate a large distance. Further, task 
scheduling and load balancing is easier because the load information is 
available centrally. Therefore, scheduling of tasks can be done easily 
without having to transfer huge amount of data from one processor to 
another. Let’s assume that there are a total of n grid points in the image 


on which quadratic patches are to be fitted. The following steps in the 


stereo vision algorithms also involve similar algorithms. Due to 
limitations on space we are omitting the descriptions of the algorithms. 
However, a brief description of what is involved in rest of the steps is 
provided. 

(4) Locating Contours : Locating contours involves checking for 
discontinuities in the surfaces. This also can be done in parallel using the 
common data memory MIMD algorithm. A similar algorithm to the one 
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Accum_array(i,j) for one processor 


Fig. 4: Accumulator array mapping for Hough Transform 
on each Processor 


ALGORITHM ACCUMULATE_COUNT 
Each processor P;, 1<i <p? does the following (in parallel) 
For j= 1 tondo . 
For each (x,y) in the array such that (x,y) is significant do 
compute r(0;;) := xcos Q;; + ysin 0;; 
Acc_array(0;;1(0;; )/rres) -=ACC_array 
(0;; ,1(0;; res +1 
If Acc_array(0;;,1(9;; )/7res) > threshold value then 
Link_array(9;;,1(0;; )/rres) := true 
End _if 
End_For 
Transfer (x,y) value to next processor in the circular pipeline 
End_ For 
END_ACCUMULATE_COUNT 


above can be used easily with the main body being different. Instead of 
checking for the compatibility at each grid point, a bipartite circular patch 
is fit (in four directions) in order to detect edges due to discontinuity. If 
enough processing power is available then check for edges can be done in, 
more than four directions. 
(5) Surface Interpolation : Surface interpolation can be done in parallel 
for each point P on the surface by taking the weighted height of the point 
(where weight depends on the distance of the point from each surface 
within a distance of 2w from the point). Again the algorithm can be easily: 
implemented in MIMD mode using the shared memory model. 
Essentially, the algorithm reduces to computing a weighted average 
within a three dimensional neighborhood window for each point. Familiar 
algorithms for weighted average can be used. However, the algorithms 
needs to be MIMD because unlike the standard SIMD neighborhood 
algorithms, the computation is data dependent and neighborhood size’ 
itself depends on the data. Therefore, neighborhood for each point may. 
be of different size depending on the number of different surfaces within 
a distance 2w from the point. 
(6) Computing Parameters for Next Level : This step requires 
computation opposite to that needed in data compression. In this step, 
data is expanded for the next finer level. Essentially, the grid is doubled 
ALGORITHM LOCAL MAXIMA 
Each processor does the following in MIMD mode 
For each entry Link_array(i,j) do 
If (Acc_array(i,j) > Acc_array(i-k,j)) AND 

(Acc_array(i,j) > Acc_array(i+k,j)) 

for all k s.t. 1<k<w for a certain 

neighborhood of size w Then 

declare Acc_array(i,j) as local maxima 
This gives a line with the (r ,6) parameters 
End _ If 

END LOCAL MAXIMA 


Lists of nodes indicating possible local maxima 
Fig. 5: Link_Array data structure to reduce 


search space for computing maxima 
ALGORITHM QUAD PATCH; 
Each Processor P; ;, OSi <p «1, O€j<p—1 in parallel do 
Grid_point_count;; = la 
repeat 
If Grid_point_count; ; >0 (Local Grid_points with P; ;) 
Select a row(Grid_point_count; ;) from its local memory. 
Mark Grid_points as selected in the common data memory. 
Update load information in the common data memory 
For each selected Grid_point Do 
Check the compatibility of the each plane in 
the neighborhood with the averaged parameters 
of each set. 
Choose the two most compatible sets make the 
plane member of these two sets. 
End_For 
Else 
Check load information in the common data memory. 
If Grid_point_count; , #0 for some P;, Then 
Select a Grid_point from the processors with 
maximum Grid_point_count. 
Mark the selected Grid_point. 
Update load information in the common data memory 
End If 
Until finished (i.e., no more Grid points left to be selected) 
End QUAD_PATCH 
‘in each direction but preserving the surfaces in the present level. Then the 
quadratic patches are interpolated using the compatibility of the 
neighboring existing quadratic patches. This can be accomplished by 
relaxation algorithm used previously. 
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Once the parameters for the next finer level are computed, the 
entire algorithm (involving all the steps) is executed on a finer image and 
more accurate description of the 3-D surface is obtained. This iterative 
process is continued until the finest level of the image is reached which 
provides the most accurate description of the object. 

IV. Summary 


In this paper we considered a multiprocessor architecture for 
integrated vision. Its processing power is concentrated in clusters of 
powerful processors connected through flexible and programmable 
crossbar. Its system control functions are distributed over a hierarchy of 
controllers. We considered a parallel image processing model and 
discussed the architecture as well as the 3-D stereo vision algorithm in 
the context of the model. We illustrated how various steps of the 
integrated system can be mapped in parallel, pipelined and what are the 
‘computation, communication, data flow, scheduling and load balancing 
requirements. Furthermore, we argued why the proposed mapping and 
integration of tasks is efficient. 


We are in the process of simulating the architecture and algorithms. 
both in an independent environment as well as in an integrated 
environment to investigate its performance in the light of various issues 
discussed earlier. Furthermore, we propose to compare performances of 
various low level, high level and hybrid algorithms mapped on 
architectures such as hypercube with the mapping on the proposed 
architecture. This performance study is also aimed at identifying other 
issues to be considered for an integrated vision system architecture which 
may have been overlooked and also suggest refinements in the 


architecture. 
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AN OPTIMAL SOLUTION FOR CONSENSUS PROBLEM 
IN AN UNRELIABLE COMMUNICATION SYSTEM 
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Abstract Traditionally, consensus problem is 
solved in a fully connected network with node failure 
assumption. This paper discusses the consensus problem 
with the assumption of link failure. A simple and 
efficient protocol FLINK is proposed. The complexity of 
information exchange required by the protocol is O(n2). 
The protocol uses minimum number of rounds to 
achieve a consensus and can tolerate maximum number 
of allowable faulty components: | 


1. Introduction 


To achieve an agreement on a predefined value 
in a distributed system, protocols are required so that 
the system will run even if certain components in the 
distributed system were failed. Such a unanimity 
problem was studied by Lamport [7,8], and it is called a 
Byzantine Agreement (BA)[2,3,4,6,7,8,9,10,11]. A 
closely related sub—problem, Consensus problem, has 
been studied extensively in the literature [1,7]. In the 
paper, we concern the solution of consensus problem. 
The definition of such a problem is to make the correct 
nodes in an n nodes fully—connected network to reach a 
common agreement. Each node chooses an initial value 
to start with, and communicates each other by means of 
message. The desired protocol is to solve the consensus 
problem if it satisfies the following constraints: 


(Agreement): All correct nodes agree on the 
same value. 
( Validity): If the initial value of all nodes is 


vi, then all correct nodes shall 
agree ON Vj. 

Many results in a Byzantine Agreement or 
consensus problem are based on the assumption of node 
failure in a fail-safe network [1-11]. Base on this 
assumption, a communication link fault is treated as a 
node fault, regardless the correctness of an innocent 
node; hence an innocent node does not involve an 
agreement. This is contradiction with the definition of a 
BA (or consensus problem) which requires all correct 
nodes to achieve an agreement. 

In this paper, we consider a distributed system 
whose nodes are reliable during the consensus execution; 
while message links may be disturbed by some noise or 
an intruder and results in the exchanged message 
maliciously. A new efficient and reliable protocol to 
achieve consensus in an unreliable communication 
environment is proposed first; then its efficiency and 
reliability are proved later. The common term round 
[1,6] is used to denote the interval of message exchange. 
The proposed protocol can tolerate [al 2|-1 faulty 
links, and requires only two rounds of message 
exchange. The amount of necessary information 


exchange is only O(n?) [4]. If a link fault is treated as a 


node fault, the number of rounds required by the 
protocol is better than the previous results [1,7]. 

In the subsequent sections, Section 2 defines the 
model and the concepts. Section 3 presents the proposed 
protocol and proves its correctness. Section 4 discusses 
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the impossible cases of an unreliable distributed system 
and the optimization of the protocol. Section 5 gives the 
conclusion and the future work. 


2. Model 


In a fully connected n—node network, if each 
node has at least n*n bytes memory; then a sender's 
message is always identifiable by a receiver; and the 
protocol's processing time can be negligible. If each 
node always works well during the execution of 
consensus protocol, but links may be damaged due to 
some noise or intruder, thus a link may be in faulty 
when its transferred message is changed or delayed. 
Conversely, a link is penne when the transferred 
message is always received correctly and on time. 

Usually a node's computation time is faster than 
the message communication time through a link; hence 
a node's computation time for protocol is ignored. 
Under such an assumption, the protocol can make the 
correct node in a fully—connected network to reach a 
predefined common value with the least number of 
rounds. 

Let MAT; be the matrix set up at node i by the 
procedure MATRIX shown in Figure 1 for i=1,...,n. In 
the first round, node i receives the preset initial value 
from each other nodes, and Vj; be the vector 
[v1,V2,-.-;Vk,--->Vn], Where v, represents the initial value 
received from node k or the initial value of node i itself, 
1<k<n. For simplicity, any value not identified or not 
received within the predefined time limit will be set to 
the complementary value >v, by the receiver. 


Procedure MATRIX ( for node i with initial value v; ) 


Step 1: Receive the initial value v; from node j, 
for 1<j<n and j#i. 

Step 2: Construct the vector Vi [v1, V2,..., Vj... 
Va], 1<jsn. 

Step 3: read the vector Vj from node j, 1<j<n, 
j#i. 7 

Step 4: Construct a MAT}. (Setting the vector Vj 


in column j, for 1<j<n.) 
Figure 1. The procedure for setting MATj on node i. 


In the second round, each node broadcasts its V 
vector to other nodes and receives n—1 vectors from the 
other nodes. MAT; is established by using vector Vj as 
the j-th column in the matrix. Note that the i-th 
column in MAT; is the vector constructed by node i in. 
the first round. In the second round, node i can receive 
n—l vectors from the rest n—1 nodes; therefore MAT; is 
an n*n matrix. Each element vj, in MAT; represents 
the value of the node k received from node j in the first 
round. Let the majority value of [v1, vo,..., va] in the 


_k+th row to be MAJ, =v; if the number of v;'s is greater 


than n/2; otherwise MAJ, is set to ?. Figure 2 shows an 
example of six nodes and vj = 1, for i=1, 2, 3, 4 and 5; 


and vg = 0. The vector V2 received by node 2 in the 
first round is shown in Figure 2(b), and the 
corresponding MAT» is shown in Figure 2(c). The 
second column in MAT» is the vector Vo shown in 
Figure 2(b), and the rest 5 columns are the vectors 
received by node 2 in the second round. If link 25 and 
link 26 are faults, the values vs2 and vee, and the 
vectors Vs and Vg may have changed maliciously. The 
majority value of each row is shown in Figure (dy. 


Thevector V2 received 
in the firstround = 


ee ae a 


(b) 


AT» 


1 

0 

0 Majority value of each row = 
1 of M 

(d) 


os = 
ng ab ot ot ot ot 


Figure 2. A network with six nodes shows the way to 
construct MAT» and get the majority value of each row. 
The two dash lines represent. two faulty links. 


Based on the properties of the consensus 
problem, the initial value of node i (denoted as vj) 
should be known by itself prior to the execution of 
consensus protocol. If node i makes a decision after 
executing the consensus protocol, it must determine 
whether or not a disagreement exists with the initial 
value, and it has to decide the initial value or a 
"default" (¢). In any case, a node with initial value v; 
should not decide on the complementary value -v;j. 

As for multivalue consensus problem, Turpin and 
Coan [11] have already shown that the protocol of a 
binary value consensus problem can be extended to a 
multivalue consensus problem, therefore only binary 
initial value is discussed. | | 


3. Protocol 


In the section the proposed protocol based on the 
model developed in Section 2 is formally presented. The 
following definitions make the protocol different from 
the previous results. 

Definition 1: Every correct node should always know 
the initial value of itself; and 

Definition 2: If the initial value is vi, then the decision 
made must be either vj or ¢, but not 7Vv;. 

Figure 3 shows the protocol FLINK which can 
tolerate [n/2|—1 fault links; and it achieves consensus 
by only two rounds of message exchange. Later, we will 
prove 1) the efficiency of the method, and 2) the 
necessary and sufficient conditions for the number of 
rounds and faults required by FLINK protocol. Let 
DEC; be the value chosen by node i to agree on with 
others. . 

Figure 4 shows the complete procedure and the 
result of the protocol FLINK for the six node example 
mentioned in Figure 2. Since node 2 has MAJeg=? in 
MAT», and veg =1 =ve. By step 3 in FLINK, DEC, = 

. Nodes 1, 3, 4 and 5 find that there is a MAJg = 0 
=ny;) in MAT), so DEC; = ¢ for i = 1,.3, 4 and 5 by 
step 2 in FLINK. For the same reason, node 6 has a 
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MAJ, = 1 (= -ve), 80 DECg = ¢. Therefore, all nodes 
agree on the same value ¢. Consensus achievement is 
done. 


Protocol FLINK ( For node i with initial value vj; ) 
Message Exchange Phase: 

Round 1: Broadcast (vi), then receive the 
initial value from the other nodes, 
and construct vector Vj. 

Broadcast (Vi), then receive the 
vectors broadcasted by other nodes 
and construct MAT}. 

Decision Making Phase: 


Round 2: 


Step 1: Take the majority value of each 
row k of MAT; to be MAJx. 

Step 2: Search for any MAJ,. If (4 MAJ, 
= wi); then DEC; := 9; 

Step 3: else if (4 MAJ, = ?) AND ( vxi 


vi), then DEC;:=¢; else DEC: 
vi, and halt. 


Figure 3. The FLINK protocol to achieve consensus. 


1 (2 
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01000 0 
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of MATs 


MAT_ = Majority value of each row = 
. of MATe 
104414 1 ; 
10311 0 41 
MAT ¢ = ; 4 : . : } Majority value of each row = : 
40444 4 } 
0 1000 0 ; 


(d) 


DEC; = ¢, ( for MAJg = 0 =avi & vi = 1); i= 1,3,4,5; 
DEC, = ¢, ( for MAJg = ? AND veo = 1 = v2) 
DECg = 4, ( for MAJ; = 1 = 7ve and ve = 0 ); 15j$9; 


The result of FLINK for the six node 


Figure 4. ut 0 
example in Figure 2. 


The following lemmas and theorems are used to 
prove the correctness and complexity of FLINK. 

Lemma 1: If there is a MAJ, = 7vyj in MAT;, 
then at least there is one node with an initial value 
which disagrees with vj in the network. | 

Proof: The majority value in the k-th row =-v; 
means that there are at least [n+1/2] -v;'s in the k-th 
row. Since the number of faulty links is at most 
Ne |-1, there exists at least one value -v; received 
rom a perfect link. In other words, there is a node 
which has an disagreeable initial value. 

Lemma 2: Let the initial value of node i be v; 
and the link ij is in perfect, then the majority value at 
the i-th row in MAT; should be vi. 

Proof: Since link ij is perfect, the node j will 
receive vj from node i in the first round and vij = vj in 
MAT;. Mean while, the value vi of node i will be 
broadcasted to the other nodes. There are at most 
|n/2|—1 faulty links in the system. In the second round, 
node j receives at least (n—1)—(|n/2|—1) = [n/2] vi's 
in the i-th row of MAT). Hence, there are at least 
[n/2]+1 vi's in the i-th row, and the majority values 
in the i-th row should be equal to Vi-g 

Lemma 3: If the initial value of node i is vi, 
whether or not link ij is in perfect, the majority value at 
the i-th row of MAT), 1<j<n, should be either vj or be ? 
with vjj=7V}. 

Proof: By Lemma 2, when link ij is perfect, the 
majority value of the i-th row in node j is vi, for 1<j<n. 
When link ij is faulty, we consider the following two 
cases in the first round. | | 

Case 1: vij = vi | 

Since there are at most |n/2|—1 faulty links 
connected with node j, there are at most |n/2|—1 values 
that may be -v;'s in the second round. The number y;'s 
is [(n—1)—(|n/2|—1)]+1 = [n/2]+1 in the i-th row; 
therefore, the majority of the ith row in MAT; is vj. 

Case 2: Vij = 7Vj 

There are at most |n/2|—1 faulty links. In the 
second round, the number of -v;'s is no more than 
rae —1)+1 = at and the number of v;'s is at least 

n—1)—({|n/2|—1)|=[n/2]. If n is an even number, then 
n/2| = [n/2], the majority of the i-th row in MAT; is 
‘. If n is an odd number, then [n/2| < [n/2], hence the 
majority of i-th row in MAT; is vi. 

Lemma 4: If (-J MAJ,=-vi) AND{(i MAJ; = 
?) AND ( vei=vi)} in MAT, then DEC; := ¢ is 
correct. | 

Proof: If there has a MAJ, = ?, there are 
exactly n/2 vj's and n/2 -Vv;'s in the k-th row. If vyj = 
vj in MATj, then all n/2 -v;'s should be received in the 
second round. There are |n/2|—1 faulty links in the 
system. Therefore, in the second round, node i at least 
receives a value from node k without disturbance. The 
initial value of node k should disagree with the initial 
value of node i; hence it is correct to choose DEC; = @ 

f vii = vi, we claim that vj ought to be 
passed by a faulty link from node k, and the initial 
value of node k should be >vxj = vi. 

To prove, if link ki is perfect, then the initial 
value of node k should be -v;. By Lemma 2, the 
majority value of the k-th row in MAT; is 7v;. This is 
contradiction with the condition of (74 MAJ, = 73). 

If the initial value of node k was -v;, then, by 
Lemma 3, MAJ, should be either -v; or ? for vay = vi. 
It is a contradiction... 
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Theorem 1: FLINK protocol is correct. 

Proof: By Lemma 1, 2, 3 and 4, the theorem is 
proved.gy 
Theorem 2: FLINK protocol can achieve 
consensus. 7 

Proof: (1) Agreement: 

Part 1: If a correct node agrees on 4@, all 

| correct nodes should agree on @. 

If the correct node m with initial value vj agrees 
with ¢, by Theorem 1, at least there is a correct node k 
with initial value >v; in the network. By Lemma 4, the 
majority value in the k-th row of MATj, 1<j<n, should 
be either >vj or ? for vyj=vi. All correct nodes with 
initial value vy agree on ¢. Similarly, for the correct 
nodes with initial value -v;, the majority value of the 
m-th row in MAT), 1<j<n, should be either vj or ? with 
Vij = 7v;. All correct nodes with initial value >v; agree 
on ¢, too. 

Part 2: If a correct node agrees on vi, all 
correct nodes should agree on vj. 

If the correct node i with initial value vj, and 
DEC; = vi, but there exists some correct node j, j#i, has 
DEC; # vi, then that is impossible. To show this, if 
DEC; = @¢, by Part 1, then DEC; = @. This is a 
contradiction with the assumption as above. 

If DEC; = -Vvi, unless the initial value of node j 
is 7vj, Otherwise it is impossible according to the 
Definition in Section 3. But if the initial value of node j 
is "vi, by Lemma 4, MAJ; is equal to >vj or ? with vjj 
= vj in MATj; then, DEC; = ¢. It is a contradiction; 
hence, all correct nodes should agree on the same value. 

(2) Validity: 

The initial value of all correct nodes should be 
the same. If there is a value >vj in MATj, 1<j<n, then 
the value must be caused by a faulty link. There are at 
most |n/2|—1 faulty links, hence there are at most 
|n/2|—1 faulty -v;'s in each row. Since the value 
received in the first round may be -v;, the majority of 
each row for all MAT), should be 


? if the value received in 
MAJ; =| the first round is -Vvi, 
vi otherwise. 

So, by step 3 of the FLINK protocol, all correct 
nodes should agree on Vi.g 

Theorem 3: The amount of 
exchange by FLINK is O(n2). 

Proof: In the first round, each node sends out 
(n—1) copies of its initial value to other nodes. In the 
second round, an n—element vector is sent to the other 
n—1 nodes in the network; therefore, the total number of 
message exchange is (n—1) + (n*(n—1)). This result 
implies that the complexity of information exchange is 
O(n2)._ 


1<jsn. 


information 


4. Impossibility 


In this section, some impossibility of the 
consensus problem is presented for the case of all perfect 
nodes on an unreliable message communication system. 
First we show that the completeness of a consensus by 
using less than two message exchanges is impossible. 
Next, when the number of the faulty links is greater 
than |n/2|—1, it is impossible to obtain a consensus. 
Based on these results, we can show that the FLINK 
protocol is optimal in the sense that it uses the 


minimum number of rounds and can tolerate the 
maximum number of faulty components by the 
following theorems. 

Theorem 4: One round of message exchange to 
achieve consensus is impossible. 

Proof: Part 1: Message exchange is necessary. 

Without message exchange, a node can't know 
whether or not a disagreeable value exists in -other 
nodes; hence consensus achievement is impossible. 

Part 2: One round message exchange is not enough to 
achieve consensus. 

If node i is connected with node j by faulty link 
ij. Node i may not know the initial value in node j by 
using only one round of message exchange. 

Therefore it is impossible to achieve consensus 
by using only one round message exchange. 

Theorem 5: If the number of the faulty links t > 
|n/2|—1, achieving consensus is impossible. 

Proof: When t > |n/2|—1 and n is an even 
number, then each node has n—1 links in the system, it 
is possible that there is a node which has more faulty 
links than a perfect link. Regardless of the number of 
rounds of message exchange, this node will always be 
confused by the message transferred through those 
faulty links. The decision made by the node may 
conflict with other nodes. In this case, consensus 
achievement is impossible.gy 

Theorem 6: Using the minimum number of 
rounds, FLINK can tolerate the maximum number of 
faulty links in a perfect node, fully—connected network. 

Proof: From Theorem 2, Theorem 4 and 
Theorem 5, the theorem is proved. 


5. Conclusion 


Previous works about consensus problem are 
based on the assumption that nodes are the only fallible 
components in the network [1,7]; however in a 

eneralized case, both the nodes and links of a 
ully—connected network could be in faulty. The 
behavior of a faulty node can effect the other nodes in a 
fully—connected network; while the behavior of a faulty 
link will only effect the two adjacent nodes. If the 
number of allowable faulty components in the system is 
given; then in a generalized case, the correct nodes 
connected by faulty components is less than the correct 
nodes connected in a conventional case; therefore the 
consensus obtained in a generalized case will be reached 
earlier than that of a conventional case. For the similar 
reason, if the number of required rounds is fixed, the 
fault tolerant capability in a generalized case could be 
stronger than that in a conventional case. For a faulty 
link case, FLINK protocol solves the consensus problem 


~ 
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by using |n/2]—1 faulty links and two rounds of 
message exchange. The amount of message exchange is 
O(n?). 

im?) In short, the faulty link case or the conventional 
faulty node case can be viewed as a special case for the 
generalized consensus problem in which both node and 
link can be in faulty. In a generalized case, FLINK 
protocol can still be used to solve consensus problem by 
using less rounds than that of a conventional faulty 
node case: In a generalized case, the number of faulty 
nodes is less than that of a conventional faulty node 
case, if the number of faulty components is the same. 
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ABSTRACT 


In the world of cost-effective supercomputing, the 
use of a multistage interconnection network (MIN) as a 
means of connecting many processing elements to many 
memory modules is widespread. Whenever such a system 
is used in a critical environment, reliability becomes an 
important issue. Up to this point reliability evaluation 


methods for multiprocessor systems have been ad hoc, 


that is, designed for, and applicable to, only one or a few 
types of topology. 

This paper presents the first automated simulation 
package with the ability to perform the reliability simu- 
lation of MIN-connected systems. The program is auto- 
mated in that the required MIN topology is built by the 
program. The user need only specify the type of MIN 
and other system characteristics. The underlying strat- 
egy of the program is to find the system reliability from 
the system reachability matrix, which is built by a search 
procedure requiring O(N S(N)) time, where S(N) is the 
number of switches in an (N x N) MIN. The package was 
used to simulate the reliabilities of many topologies pro- 
posed in the literature. Some results are presented and 
used for a comparison of the systems. 


1. INTRODUCTION 


Multiprocessor systems using multistage intercon- 
nection networks (MINs) have been an active area of 
research for more than a decade. A plethora of different 
MIN topologies have been proposed to provide commu- 
nications among N processors (PEs) and N memory 
modules (MMs). These MINs are generally designed 
with stages of (n x m) switching elements (SEs), where 
nm and m are small integers such as 2, 3, or 4. A good 
body of literature on MINs can be found in {1], and 
a survey of some fault-tolerant multipath MINs is re- 
ported in [2]. 

The novelty of a multiprocessor lies in its ability to 
provide high computing power with assured reliability. 
Reliability becomes important especially when the sys- 
tem is used in a critical application. While performance 
analyses of the MIN-based systems have been carried 
out extensively along with their design, relatively little 
attention has been paid to the reliability issues. Work 
on fault-tolerant MINs has been mostly confined to find- 
ing alternate paths between source and destination sets. 

In the past, research pertaining to the reliability 
evaluation of MINs has addressed either full connectiv- 
ity without degradation [3] or terminal reliability [4]. 
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A couple of papers have addressed the complete relia-. 
bility of MIN-based systems considering the failure of 
PEs, MMs, and SEs [5], [6]. However these works are 
restricted in a sense that either the system size is lim- 
ited or the evaluation technique is not applicable to all 
types of MINs. Recently a combinatorial approach for 
reliability evaluation of multiprocessors using (4 x 4) 
SEs is given in [7]. This analysis is applicable to only 
unique-path MINs with (4 x 4) SEs. 

As more and more fault-tolerant MINs are pro- 
posed, it is essential to develop a methodology for char- 
acterizing and comparing one system with another from 
the reliability standpoint. This type of unified evalua- 
tion technique will solve two. purposes. First, the relia- 
bility of any existing or new MIN-based system can be 
predicted. Second, depending on the implementation. 
requirements, a cost-effective MIN can be selected. The 
survey work in [2] has compared the fault-tolerant prop- 
erty of various multipath MINs. However, the work is 
not complete in a sense that the usual evaluation crite- 
rion such as the reliability /availability issue is not ad- 
dressed. In this paper we are concerned with develop- 
ing a unified reliability evaluation technique for various 
types of MIN-based systems. 


Analytical evaluation of system reliability consider- 
ing the degradation of PEs, MMs, and SEs is very diffi- 
cult due to the NP-hardness of the problem [8]. There- 
fore, all the analytical evaluation techniques have been 
restricted to mostly unique-path MINs. As we are in- 
terested in analyzing and comparing different multipath 
strategies, an analytical approach seems almost impos- . 
sible. Hence, simulation is used as the evaluation tool. 
This paper presents the first proposed automated simu- 
lation package with the ability to perform the reliability 
evaluation of MIN-connected multiprocessor systems. 

The package takes from the user an input file con- 
taining the specifications of the topology and the de- 
tails of the type of analysis desired. The user is freed 
from the interconnection details because the program 
has the ability to automatically build the proper. topol- 
ogy. A search algorithm is employed during the course 
of the simulation to find the connectivity between PEs 
and MMs in the presence of component failures. While 
the size of the system is not limited by the program, the 
host machine environment and simulation time may be 
limiting factors. The reliability model used in this paper 
is known as task-based reliability [9], where a system re- 
mains operational as long as a task can be executed on 
it. Results of (16 x 16) and (64 x 64) systems using the 


following topologies are analyzed in the paper with and 
without system cost factor involved. 

The topologies considered are (2 x 2) baseline [10], 
an extra-stage baseline, the (4 x 4) butterfly [11], the 
chained MIN [12], [13], the F network {14], the merged 
delta network (MDN) [15], the inverse augmented data 
manipulator (IADM) [16], and the interconnection net- 
work designed for reliable architectures (INDRA net- 
work) [4]. Although this selection covers almost the 
whole spectrum of MINs proposed in the literature, the 
program is not limited to only these topologies. It also 
has the ability to include virtually any MIN-based sys- 
tem. We do not fully describe each of these systems; 
more complete system descriptions can be found in the 
literature cited. Since the topologies chosen are repre- 
sentative of those surveyed in [2], this work could be 
considered a follow-up or extension of the work pre- 
sented there. 

In Section 2 we present an overview of the topolo- 
gies considered. The simulation techniques are ex- 
plained in detail in Section 3, including algorithm time 
complexities. Section 4 gives the results of the sys- 
tem simulations and offers a comparison between them. 
Concluding remarks are given in Section 5. 


2. SYSTEMS SURVEY 


This section briefly surveys the different MIN- 
connected multiprocessor systems listed in the intro- 
duction to this paper. We consider a tightly coupled 
multiprocessor environment where the PEs and MMs 
are connected through the MIN. The difference between 
the systems lies solely in the type of interconnection 
network used for communications among the processors 
and memories, and therefore each system will be de- 
scribed by its MIN. 


2.1 Unique-Path MINs 


A unique-path MIN provides only one path be- 
tween each processor and each of the memory units. The 
advantage of unique-path networks lies in the simplicity 
of their implementation. The network uses uncompli- 
cated selector or crossbar switches that require no look- 
ahead capability (i.e., they need not be independently 
cognizant of the conditions of the other components in 
the system). Each switch, if operational, merely routes 
an input to the proper output link depending upon the 
value of the request tag bit corresponding to the switch’s 
stage. This simplicity results in fewer internal compo- 
nents and consequently a high average switch reliabil- 
ity relative to more complicated switches, such as those 
used in the multiple-path systems described later. 

The disadvantage of unique-path networks lies in 
the fact that if any of the switches along a desired path 
fails, the entire path is eliminated, and the requesting 
module is unable to access the requested module. How- 
ever, if a strategy exists whereby another path can be 
found to act as a detour around the failed element, this 
fault may be tolerated. These extra paths can be pro- 
vided at the expense either of redundant passes through 
the network [17], [18] or of additional hardware; we will 
consider the latter. - 

The baseline MIN is one example of an unique- 
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path network. The topology of an (8 x 8) baseline MIN 
is shown in Fig. 1. The baseline network consists of 
n = log, N switch stages, each containing N/2 (2 x 2) 
crossbar switches. The baseline MIN was chosen to rep- 
resent other unique-path MINs proven to be topologi- 
cally equivalent to the baseline by Wu and Feng [10]. 
We use the baseline in our explanations because of its 
simplicity of representation. 

Another unique-path network, comprised of (4 x 4) 
switches, is the butterfly MIN. The BBN Butterfly Par- 
allel Processor?’ is a commercially available system 
with up to 256 processors [11]. Because of the ad- 
ditional links in each switch, the butterfly MIN has 
only log, N stages each consisting of only N/4 switches. 
Thus, the communications delay of a butterfly MIN is 
only O(log, NV’), whereas the baseline delay is O(log, N). 
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Fig. 1. A (8 x 8) baseline system. 


2.2 Path Redundancy for Unique-path MINs 


It may be possible to provide redundant paths to 
an unique-path MIN by at least four methods. The first 
involves the addition of an extra stage of switches to the 
input of the MIN. This is done by duplicating the MIN 
input stage along with its output link interconnection 
pattern; this extra stage is inserted between the proces- 
sors and the previous input stage. This method is ex- 
amined in this research by the addition of an extra stage 
to both the baseline and the butterfly MINs. A second 
redundancy method adds chaining links to each switch, 
partitions all the switches, and then chains together the 
switches in the same partition. The chained baseline 
MIN was examined as an example of this method. A 
third redundancy strategy consists of replicating r times 
a network consisting of (r x r) switches (e.g., the base- 
line MIN consists of (2 x 2) crossbar switches, there- 
fore 2 network copies would be provided). This INDRA 
technique is examined for the baseline network. Finally, 
the fourth technique crosslinks c copies of an (% x %) 
unique path network. This technique can be applied to 
any of the topologically equivalent unique-path MINs, | 
but traditionally, it is employed using the delta topology 
to form the MDN. | 


2.3 Inherently Path-redundant MINs 


Anticipating the need for redundant paths in a 
MIN, some topologies have been designed to provide 
these multiple paths without a need for further mod- 
ification. One example, the IADM, uses (log, N) + 1 
switch stages, each consisting of N (3 x 3) switches. 
Another example, the F network, uses log, N switch 
stages, each with N (4 x 4) switches. These two net- 


works show most clearly the truism that statically re- 
dundant paths require redundant hardware. 


3. SIMULATION TECHNIQUES 


In this research, it is assumed that the MIN of each 
of the systems considered has an input side and an out- 
put side, and all communications between modules are 
carried out in one pass from an input position to an 
output position. Using the system of Fig. 1 as an ex- 
ample, communications between processor 0 (node 20) 


and processor 3 (node 23) would follow the node path 
20 — 16 — 12 + 9 — 3 and not 20 — 16 — 12 — 
17 — 23. This example also illustrates the equivalence 
of input and output positions. 

Depending upon the implementation, each of the 
topologies can be operated under either a circuit- 
switched or a packet-switched communications proto- 
col. Under circuit switching, a physical link is estab- 
lished between two modules, and is used for transmis- 
sions in both directions. In contrast, packet switching 
is an asynchronous simplex protocol where information 
packets are exchanged via the network. The program 
has the ability to simulate either of these protocols. 

To use the program, the user need merely edit an 
already-existing input file. The information in the input 
file includes the following. 

The system size. 

The system type (from a menu list). 

The communications protocol. __. 

The failure rates of PEs, MMs, and SEs. 
Whether each PE is assigned a local MM. 
The number of copies of the MIN (for INDRA 
case). 

g. The output file name. 

At the beginning of the simulation, this informa- 
tion is read into the program. The program then calls 
the appropriate procedure to build the desired topol- 
ogy. The ability to build the topology automatically 
(described later) is especially important in the case of 
large systems, when hand entry of the interconnection 
pattern becomes very difficult. 

The simulator determines the system condition, 
whether operational (up) or failed (down), from R, the 
(NV x N) system reachability matrix. R describes the 
connectivity between modules in the following way: if 
at least one path exists between processor p and mem- 
ory m, then the matrix element R[p,m] = 1, otherwise 
R|p,m] = 0. In an unfailed system (at system startup), 
R=1. As components fail, the degree of system degra- 
dation can be determined from R. If a system is defined 
as being up if it has at least 2 processors and 7 memories 
all being both operational and completely connected to 
each other, then the system condition can be determined 
by examining R to ascertain whether a submatrix of or- 
der at least (2 x 7) and with elements all of value 1 can 
be found in R. 

The simulator is based upon the following concept. 
The reliability evaluation of any system is dependent 
upon its reachability matrix. Therefore, if a program 
can be developed to find system reliability from R, and 


if any MIN-based system can be reduced to its reach- 
ability matrix, then the reliability of any MIN-based 


moma of 
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- system can be simulated. 


The algorithm for simulating system reliability 
from R is as described in [5] and will not be detailed 
here. The remainder of this section will explain the 
program with respect to internal system representation 
and the characterization of a search-traversal algorithm 
capable of finding R. The serial version of the search 
algorithm is given, and possibilities for a parallel imple- 
mentation are explained. 


3.1 System Representation 


The topology of each network is represented by cer- 
tain constants and arrays as follows (all parenthesized 
examples correspond to the system of Fig. 1). The sys- 
tem constant a is the total number of PEs, MMs, and 
SEs present in the system (e.g., a = 2N + * log, N = 
28); the system constant of fset (= a— WN) is the vertex 
number of processor 0 (e.g., of fset = 20); finally, the 
system constant b is the maximum number of output 
links per node present in the system (e.g., b = 2). The 
vertices are numbered from 0 to a — 1 in column-major 
order beginning with the vertex corresponding to mem- 
ory 0 and ending with that of processor N —1. The out- 
put links (if any) are numbered from 0 to b—1 from top 
to bottom. The (ax 6) matrix, T, describes the intercon- 
nection pattern of each topology as follows: the element 
T(t, j| is the component connected to output link 7 of 
component 7, where 7 € {0..a — 1} and j € {0..0— 1} 


(e.g., T[13,0] = 8). By convention, if T|z,7] = —1, 
then output link 7 does not exist for component 2 (e.g., 
T(3,1] = —1). The one-dimensional boolean array liv- 


ing represents the system component failure condition 
as follows: for all 7 = 1---a —1, if living|t| then com- 
ponent 2 is operational else 27 is failed. 

Since all the systems can be represented by this 
scheme, the user need only indicate the network topol- 
ogy and size. The program calculates the system con- 
stants and calls the appropriate procedure to build T. 

These representations are needed because the pro- 
cedure which finds the reachability matrix of any sys- 
tem is a search-traversal algorithm. A search-traversal 
strategy is necessary since the conventional methods of 
finding R reported in [5] do not work in the case of 
multipath systems such as those surveyed in [2]. 


3.2 A Serial Algorithm for Finding R 


As described in [19], a MIN-based multiprocessor 
system can be represented by a directed graph. Since 


this is true, it follows that an (N x N) system can also be 
conceptualized as a grove of N search trees, where the 
root of each tree corresponds to an individual processor 
vertex, its leaves represent the memory units, and its 
shape depends upon the network topology. The search’ 
tree for the unfailed system of Fig. 1 can be seen in 
Fig. 2(a). The effect of a component failure will be to 
prune the tree at the appropriate position(s) held by 
that component in the tree. For example, if the switch 
cotresponding to node 12 of Fig. 1 fails, the resulting 
search tree is as seen in Fig. 2(b). 

The problem, then, of finding the connectivity be- 
tween any processor p and any memory m at a given 
time reduces to: is m present in the search tree of 
p? This can be accomplished by initializing R to 0, 
and then performing a reverse preorder traversal of the 
search tree of p, setting to 1 the appropriate elements 
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Fig. 2: Search tree for processor O of Fig. 1. (a) For 
an unfailed system. (b) Pruned after failure of 
node 12. 


of R each time a leaf vertex (memory) is reached. The 
termination of the traversal search of the tree of proces- 
sor p results in the completion of row p of R. Therefore, 
N such searches will complete the entire reachability 
matrix. The Search algorithm of Fig. 3 performs the 
search of one tree, and the Reach algorithm of Fig. 4 
uses Search to build the reachability matrix, R. 

The stack of Search is initially empty. At the be- 
ginning of the search, the source vertex is pushed onto 
the stack. The following steps are repeated while the 
stack is not empty. 

1. A node is popped from the stack and checked 
if it is a leaf; if so, then the proper element of 
R is set to 1. 

2. The processed node is tagged “visited.” 

3. The unvisited neighbors of node are each 
pushed onto the stack in birthright order. 

Birthright order is from leftmost child to rightmost 
child in the tree representation. For example, birthright 
order for vertex 16 of Fig. 1 is vertex 14 and then vertex 
12. This order ensures the depth-first traversal desired. 
The marking of the processed vertices as “visited” serves 
two purposes. 

1. The algorithm is kept from infinitely traveling 
around any closed loops possibly inherent in a 
topology (such as the very visible loops of the 
chained baseline MIN structure). 

2. A mechanism is provided by which future con- 
sideration of a vertex once processed is denied. 
This helps to reduce search time by ensur- 
ing that each vertex is processed only once, 
thereby providing an extra stage of pruning in 


systems whose search trees have vertices hold- 
ing multiple positions. It is important to note 


that this pruning mechanism makes this algo- 
rithm suitable only for finding whether at least 
one path exists to each memory—not for find- 
ing the number of paths to each memory. 


procedure Search (source : integer) 
begin 
Mark each vertex “not. visited”: 
Zero row source -- of fset of R: 
Reset the local stack; 
Push source onto the stack; 
while stack is not empty do begin 
node := stack pop; 
if node < N then begin 
(* node is a memory *) 
R|source — of f set, node] := 1 
end; (* if *) 
Mark node “visited”; 
for each of node’s childreni do 


begin 
if (7 is not visited) 
and ((z < N) 


or (7 is living)) then begin 
Push 7 onto stack; 
end; (* if *) 
end; (* for *) 
end; (* while *) 
end; (* procedure *) 


Fig. 3. Tree-search algorithm. 


procedure Reach; 

procedure Search; 

begin 

L1: for p:=0 to N—1 do begin 

source := p+ of fset; 

Search (source): 
end; (* for.*) | 
if packet-switched protocol then 
begin (* packet adjust *) 

L2: forp:=0 to N—1 do begin 
form:=p+1 to N-—1 do 
begin 

if R[p,m| = 0 
or R[m,p|=0 then 


L3: for each processor p do begin 
if p is failed then 
zero row pof R; 
end; (* for *) 
L4: for each memory m do begin 
if m is failed then 
zero column m of R; 
end; (* for *) 
end; (* procedure *) 


Fig. 4. Algorithm for finding R. 


3.3 Algorithm Time Complexities 


It is important that the time complexities of Search 
and Reach be calculated since they are used frequently 
in the simulation program. In the calculations that fol- 
low, the function S(N) gives the number of switches in 
the MIN as a function of N (e.g., S(N) = 4 log, N) 
for a (N x N) baseline MIN). The relative growths of 
N and S(N) are topology dependent. In the case of 
the F network, for example, S(N) = Nlog,N > N 
always. However, in the separate example of the but- 
terfly network, S(N) = “ log, N, which is greater than 
N only for large N (i.e., N > 256). Since the asymp- 
totic time complexities concern systems with large N, 


it is assumed in the derivations below that in all cases 
S(N) > N. 


3.3.1 Time Complexity of the Search Procedure 


The Search procedure of Fig. 3 consists of two 
parts: the nonsearch statements (those before the while 
loop) and the statements of the search loop (those mak- 
ing up the while loop). The time complexity of the 
nonsearch statements is O(S(N)) because the domi- 
nating term is the Mark statement since it initializes 
all 2N + S(N) vertices, and S(N) grows faster than 
N. The search statements consist of: three assignment 
statements, each with constant time complexity, and a 
for loop of O(b) = O(1) time complexity since the max- 
imum number of links per node, 6, remains constant 
as the size of the system grows. Thus the statements 
within the while loop are all of constant time complex- 
ity, and the time complexity of the entire loop is gov- 
erned by the maximum number of iterations of the loop 
as follows. The marking as visited of each processed 
node ensures that each vertex is considered only once 
during each tree traversal. Since the maximum number 
of vertices that a single search can consider is the S(N) 
switches plus the N memories plus the processor root, 
the time complexity of a search is O(S(N)). Therefore, 
the time complexity of procedure Search is 


| Ts(N) = O(S(N)). 
3.3.2 Time Complexity of the Reach Procedure 


Procedure Reach can be seen to consist of four 
loops labeled L1 through L4 in Fig. 4. Loop L1 per- 
forms N successive calls on Search, and therefore has 
an O(NS(N)) time complexity. Loops L3 and L4 each 
perform N iterations of constant-time conditional as- 
signment statements, so each has an O(N) time com- 
plexity. Loop L2 performs constant-time conditional 
assignments as many times as there are elements of R 
above the main diagonal, or $(N?— N). Since L1 is the 
dominant loop, the time complexity of Reach is 


(1) 


Tr(N) = O(NS(N)), (2) 


3.4 A Parallel Algorithm for Finding R 


Time requirements for the searches can be lessened 
even further by performing them in parallel. The paral- 
lelism of the Search algorithm is apparent; each search is 
completely independent of every other search (i.e., there 
are no data dependencies between the searches). This 
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- parallelism can be easily exploited on an array of P pro- 


cessors, where P = 2” for positive integer n. If N > P, 
the technique of loop concurrentization [20] could be 
used without much difficulty by partitioning the rows 
of R (searches) and assigning a different partition to 
each processor. For ease of explanation, however, we 
will consider the case where P = N, where each of the 
processors would be assigned a different row of R. 

When a packet-switched protocol is being simu- 
lated, the execution of the packet-adjustment state- 
ments in loop L2 of Fig. 4 introduces data dependencies, 
and communications between processors becomes nec- 
essary. Specifically, each processor adjusts its row of R, 
but only the elements to the right of the main diagonal. 
Each of these elements R[2, 7], where 1 > 7, is compared 
to the element R[j,2] symmetrical to it with respect to 
the main diagonal. But before a processor 1 can prop- 
erly examine an R[j,1| value, it must receive a signal 
from processor 7 that the search of row 7 is complete. 

The blocks labeled L1 through L4 of Fig. 4 can each 
be reduced by a factor of N if the procedure is imple- 
mented in parallel. In this case, the dominant execution 
sequence would be the Search procedure of L1. There- 
fore, if procedure Reach is performed in parallel with 
P = N, the time complexity is O(S(N)) by Eq. 1. 


4. RESULTS AND DISCUSSION 


This section compares the selected topologies with 
respect to their simulated reliabilities. In addition, since 


the design emphasis on MINs is inspired by a need 


for cost-effective communication networks, the systems 


were compared with respect to the ratio of system relia- 
bility to system cost (or reliability-to-cost ratio, RCR). 
The program also has the ability to simulate a system 
with any given coverage factor, C [21]. Results were ob- 
tained for systems of different sizes under both circuit- 
and packet-switched protocols and with different values 
of C’. However only selected outputs are presented here. 


4.1 Elements of the Comparison 


For the comparisons to be valid, the processors as 
well as the memories were assumed to be homogeneous 
within each system, and the same processor and mem- 
ory failure statistics were used in each type of system. 
In this way, each network differed from the others only 
in its type of MIN. 

The reliability of each system is directly related to 
the mean failure rates of the individual elements mak- 
ing up the interconnection network. The mean failure 
rates for processors, Ap, and memories, Am, were each 
taken to be one per 104 hours. Since the systems dif- 
fer only in the type of MIN used, any change in Xp 
or Am will affect all of the systems equally. Therefore, 
the comparison depends upon the failure rate of the 
MIN, which depends upon the failure rates: of the indi- 
vidual switching elements. In the absence of any prac- 
tical failure data, switch failure rates were calculated 
using the MIL-HDBK-217B reliability model for metal- 
oxide-semiconductor integrated circuits [22]. Details of 
the assumed switch design and failure-statistic calcula- 
tions can be found in [23]. The program considers the 
MIN to consist of three sets of switches: the input bank, 
the output bank, and the banks between the input and 


the output banks. Then the characteristic switch failure 
rate is assigned to the switches of each set. 

The network cost factor is the sum of the costs 
of all the individual switches comprising the network. 
The cost of each switch is calculated as a function of 
the number of its input links, n, and the number of its 
output links, m, using the equation of the cost function, 
C(n,m), of Eq. 3. 

C(n,m) = { nm for a crossbar switch; (3) 
n-+m_ for a selector switch. 


Eq. 4 calculates Cy, the cost of a MIN consisting of 
x different types of switch, each type 1 having a cost 
C;, and a population N;. The network cost factors are 
then calculated by dividing each network cost by the 
minimum cost of all the networks of the same size (in 
this case the baseline). 


cw = IMC. (4) 
od | 


4.2 System Reliability Comparison 


The reliability curve for a task requiring 50% of 
the total number of PEs and 50% of the total number 
of MMs was obtained for each of the systems. 

Figs. 5 and 6 contain the (16 x 16) system curves 
for a circuit-switched protocol and a coverage factor of 
C = 1.0 and C = 0.8 respectively. The difference be- 
tween the curves of Fig. 5 is noticeable, however when 
the system’s ability to reconfigure itself is relatively 
weak as in Fig. 6, the fault-tolerant scheme has less 
of an effect on reliability. In fact, with a coverage factor 
of 0.8, the topology of the MIN seems to have almost 
no effect at all on reliability; the curves are almost in- 
distinguishable from each other. This observation fol- 
lows intuition: a topologically inherent fault-tolerance 
scheme is of benefit only if the maintenance processor 
is able to utilize it. 

The reliability curves for (64 x 64) systems under 
a circuit-switched protocol with C = 1.0 are shown in 
Fig. 7. As expected, the INDRA, F network, chained 
MIN, MDN, and extra-stage MINs give high reliability 
compared to the unique-path MINs. Also, as the system 
size increased, the reliability gain of multipath networks 
becomes more pronounced. 

Probably the most surprising curve, however, is 
that of the IADM system. It does not seem to agree 
with intuition that it would have a reliability consis- 
tently below all the others, including the unique-path 
systems. However, upon further inspection, the rea- 
sons become clear. The IADM has multiple paths, but 
they are not evenly distributed between all processor- 
memory pairs. For example, there is only one path be- 
tween processor 1 and memory j when 2 = 7. However, 
probably the most significant reason for the low reliabil- 
ity is that the IADM contains N(log,(V)+1) switches— 
many more than the *. log, N switches of the baseline 


or the a log, N switches in the butterfly. If the failure 
rate of an individual switch is 4,, then the failure rate 
of the switches in the system is given by oA,, where o 


397 


is the number of active switches at any time. Clearly, 
if o is very large (as in the IADM), then the switches 
will fail much more frequently than if o is small (as in 
the baseline, or especially the butterfly). Therefore, the 
combination of unevenly distributed paths and quick- 
failing switches makes the IADM a less reliable system 
compared to the others. 
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4.3 System RCR Comparison 


When the cost of a system is to be considered along 
with its reliability, a useful measure is the reliability-to- 
cost ratio (RCR), i.e., the reliability is divided by the 
system cost factor. This ratio serves as a comparison 
between the networks surveyed in Section 2. We ob- 
served that for smaller (16 x 16) systems, the RCR puts 
the unique-path butterfly and baseline MIN systems at 
the top of the ranking. The increased reliability of the 
more complicated multiple-path MIN systems does not 
compensate for the extra system cost. However, as the 
size of the system grows, the curves for the unique-path 
systems fall below some of those with multiple-paths, 
as seen for a (64 x 64) system in Fig. 8. In these larger 
systems, the extra cost begins to be of some benefit, 
especially the addition of an extra stage to the baseline 
which takes the number one spot. When cost is con- 
sidered, the F network falls from the upper positions of 
the R(t) curves to occupy the lower two positions along 
with the IADM system in the RCR curves. 


4.4 Summary of System Comparisons 


From the examination of the curves of Figs. 5 
through 8, the following system -evaluation is offered. 


i>) 


These observations are based upon the particular switch 1. 
‘failure calculated as described above. 

The best overall reliability is offered by the extra- 
stage baseline MIN. The reliability of this topology 
ranked in the top four. Its value is most clearly seen, 
however, in the RCR comparison of (64 x 64) systems, 
where it ranks in the number one spot. This indicates 
that for large systems where cost is a consideration, the 
best fault tolerance technique is the addition of an extra 
stage on a baseline (or topologically equivalent) system. 
The IADM system was the least reliable of the systems 
compared due to the reasons mentioned earlier. 0.2 

Although the simulation of large systems takes a lot 
of computer time, a comparison of Figs. 5 and 7 shows 
that the effect of the MIN topology on reliability has oe 
a greater effect as the size of the system grows. This 
indicates rather strongly that the algorithms described 
in this report should be implemented on a large-grain 
parallel processor. Fig. 7. 
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5. CONCLUSIONS 


This paper reports the first automated package 
with the ability to simulate the system reliability of 
virtually any MIN-based multiprocessor system. The 
program accepts the type of MIN and various failure 
rates from the user. It builds the MIN automatically 
and stores the interconnection pattern in a matrix. A 
traversal-search algorithm is used to find the reacha- 
bility matrix of the system with random faults. The 
reachability matrix in turn is used in the calculation 
of the system reliability. The unified approach of this 
package makes possible system reliability predictions as 
well as system comparisons with respect to reliability 
issues. 

The package in its present form provides the frame- 
work around which many other features can be built. 
For example, one important extension could be the abil- 
ity to measure system performance in the presence of 
faults. Addition of this performance predictor to the 
reliability model will give the program the capability to 
predict performance-related reliability measures. An- 
other addition could be the capability to predict the 
coverage factor of the system from a model of the in- 
dividual processors and the maintenance processor. In 
this way, the coverage factor, shown in this research to 
be so important to system reliability, could be calcu- 
lated rather than estimated and provided by the user. 
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Abstract -- An adaptive checkpointing method and a 
companion rollback method which are mostly suitable for 
tightly-coupled multiprocessor systems, are proposed in this 
paper. An interprocess communication protocol is employed to 
synchronize checkpointing. Based on the checkpointing 
method, the companion rollback method restores the system to 
a recoverable state. Comparison of performance, in terms of 
checkpointing and recovery overheads, among our method, an 
existing unplanned method and an optimistic unplanned 
method which is mainly used as a performance index is 
provided to contrast these rollback recovery methods. Two 
comparison results are presented. The first result shows the 
performance level of the three methods at the same parameter 
values. The second result illustrates the optimum performance 
level of these methods. The performance evaluation reveals 
that our method is better than the existing unplanned method in 
most cases and sometimes better than the optimistic method. 


Performance breakpoints of these three methods are also 


depicted to investigate the constraints on individual methods. 


1, Introduction 


Rollback recovery methods [2]-[12] have been 
proposed to cope with reliability, availability and performance 
issues of computing systems. In a system with rollback 
recovery mechanism, a user or system program can be 
decomposed into recovery blocks. Recovery block [5] is a 
program structure consisting of checkpoint, primary process, 
alternate processes and acceptance tests for both primary and 
alternate processes. Rollback recovery methods can be 
categorized into two groups based on checkpointing policy. 
The first is the unplanned method which doesn't impose any 
constraints on processes regarding scheduling and 
. communication among processes to establish checkpoints. The 
second method, termed planned or global method, on the 
contrary, does impose some constraints [6]. Unplanned. 
recovery has the advantage of freeing processes for useful 
computation during checkpointing period, yet it suffers the 
_ domino effect [5] due to its lack of coordination among 
processes. Planned recovery observes opposite effects. 

An adaptive method is proposed here to improve 
rollback recovery performance. Our method, a mixture of 
planned and unplanned methods, is mostly suitable for 
tightly-coupled systems with centralized control due to low 
time. overhead on checkpointing and recovery in such systems. 
The forthcoming performance comparison result shows our 
method is usually better than the FIMR2M method [2]-[3] and 
sometimes better than the optimistic method [3]. 

Section 2 describes the adaptive checkpointing and 
rollback methods along with their contrast to the two 
counterparts of the unplanned methods to be compared. 
Section 3 and 4 convey the analyses and comparisons on 
shared-memory and message-passing systems respectively. 
Last section envisions future research areas and concludes our 
observations. 


2. Adaptive Rollback Recovery Method 


The concept of rollback recovery is illustrated in Fig.1. 
Tp is the intercheckpointing interval. Trb is the time taken to 
establish a checkpoint. A set of checkpoints is consistent and 


400 


thus constitutes a recoverable line if each process Pi after 
having established its checkpoint only communicates with 
other processes in the same subset that have also established 
their checkpoints [4]. 
2.1. Adaptive Ch inting Meth 

The interprocess communication protocol we propose to 
enforce synchronization among processes is based on the 
following discussion. Fig. 2 illustrates a few cases regarding 
interprocess communication during checkpointing period to be 
resolved to avoid domino effect. 1j and 2j are the instants that 
process 1 and 2 respectively recognize the checkpointing 
signal for recovery block j. They establish checkpoints RP1(j) 
and RP2(j) at a later time. m(n,j,d) is the nth request message 
to the requested process d during recovery block j of the 
requesting process, and a(n,j,s) is the acknowledgement of the 
requested process to the nth request message from requesting 
process s in s's recovery block j. For all messages initiated 
before RP2(j) and after RP1(G), they must be rejected or 
delayed to avoid domino effect. For example, m(n+1,i,1) is 


issued by process 2 in its recovery block i and recognized by 


process 1 in its recovery block j, this request must be rejected 
by process 1 and reissued by process 2 after checkpoint 
RP2(j). m(n+1,1,1) in recovery block 1 thus becomes m(0,j, 1) 
in recovery block j. For m(0,j,2), it will be received as a 
tentative message and processed after RP2(j). A tentative 
message in current recovery block will be committed as a 
permanent message in next recovery block. In a system where 
acknowledgement is supported the acknowledgement can be 
issued as a(0,i,1) in the old recovery block i or delayed as 
a(Q,j,1) in the new recovery block j. 

In contrast to our checkpointing method, the global 
checkpointing method disallows initiation and recognition of 
such messages as m(n+1,i,1) and m(0,j,2) whereas the 
unplanned method allows all messages to be initiated and 
recognized at any time. 

Processes which have already established the 
checkpoint and involved in interprocess communication with 
processes yet to establish checkpoint form a global group, and 
those processes which have established the checkpoint and/or 
are not involved in interprocess communication form an 
unplanned group. 

Synchronization among processes can be implemented 
implicitly by interprocess communication protocol. No 
additional phases, such as the phases in the two-phase commit 
protocol [6], are needed to synchronize the processes. 
Messages can be sent with a sequence number specifying in 
which recovery block they are generated. In shared-memory 
System, the sequence number can be implemented by 
“checkpoint bits" in the address bus. 


2.2. Adaptive Rollback Method 


Our rollback method is based on the criteria founded by 
the checkpointing method. Rollback might seem deceptively 
simple if cares are not taken. Lack of synchronization 
regarding rollback results in a situation similar to the domino 
effect caused by lack of synchronization regarding 
checkpointing. ‘Livelock' described in [12] is one example. 


We will analyze cases in Fig. 3 in which rollback is 
performed between two processes. Extension to more than 
two processes can be achieved through rollback propagation 
and is not addressed here. 

P2 fails at tl or P2 fails at t4 --- P2 rolls back to 
RP2(i-1) and P1 rolls back to either RP1(i-1) or RP1(i) 
depending on if it has communicated with P2 in RB2(i-1), 
even though P1 is already in RB1(i). It is the logical recovery 
block RB2(i-1) that we just referred to since it is the same as 
logical RB1(i-1). The physical boundaries of RB1(i-1) and 
RB2(i-1) are not the same. 
Case 2, P2 fails at t5 or P2 fails while establishing RP2(i) --- 
The message m(Q,i,2) is recorded in Pl's message log and 
received by P2 in RB2(i-1). Yet m(0,i,2) will not be 
committed until P2 has established RP2(i). P2 will then roll 
back to RP2(i-1) and P1 to either RP1(i) or RP1(i-1). 
Case 3, P2 fails at t7 --- P2 rolls back to RP2(i) and P1 rolls 
back to RP1(i) since message m(0,i,2) is treated as the 
message after RP2(i) and recorded in both message logs in 
RB1(i) and RB2(i). 
Case 4, Pl fails at t2 or Pl fails at t3 --- Pl rolls back to 
RP1() and P2 goes on as usual. 
Case 5. Pi fails at t6 --- Pl rolls back to RP1(i) and P2 goes 
on as usual. 
Case 6. P1 fails while establishing RP1(j) --- P1 rolls back to 
RP1(i-1) and P2 rolls back to RP1(i-1) if it has communicated 
with P2 in RB1(i-1). Otherwise P2 goes on as usual. 

. Unplanned Recovery Meth 

Two unplanned recovery methods will be briefed to 
contrast the differences among our method and these methods 
to be compared. We first brief the optimistic method. This 
method rolls back only the necessary number of steps as 
determined by interprocess communication pattern. It is 
concluded in [3] that only a few checkpoints are needed 
depending on various system parameters. We present this 
method only to demonstrate its performance as an index in the 
performance comparison of the other two methods. This 
method isn't necessarily a better performer in all cases than 
the adaptive method. Obviously, it betters the FTMR2M 
method in all cases. 

FTMR2M method is another unplanned recovery 
method. It heavily relies on the assumption that probability of 
single-step rollback overwhelmingly dominates probabilities of 
multiple-step rollbacks. The system thus records only two 
checkpoints to rollback one step if single-step rollback is 


determined. Otherwise, the system simply rolls back to the 
origin of the task as if the failure is fatal. 
hared Mem m 

Based on the following assumptions, a mean value 
analysis will be given to compare the performances of these 
rollback recovery methods. 
(1) mean time to rollback to last checkpoint is Tr. 
(2) mean checkpointing time is Trb. 
(3) interprocess communications are uniformly distributed. 
(4) independent exponential failure distribution is assumed for 
all processing modules. 
(5) probabilities of fatal and nonfatal failures are constants P¢ 


and Pn¢ respectively, and Pr + Py¢ = 1. They are independent 


of the underlying Poisson failure distribution 
(6) probability of i-step rollback is P,p(i), i = 1,2,...,.M-1, and 


> P;p(i) =1 for all i's. They are also independent of the. 


underlying Poisson failure distribution. 


1. Derivation of Mean E ion Tim 
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(A) Optimistic Unplanned Method --- The mean task execution 
time is : 


M-1 
TA=Tp*Mt+p a *{Prg*[Tr*Pyp(1)+2[Tr+Tp*(([G-1)*G-2)/2 
i=2 
+(M-i+1)*(-1))/M)*Pyp@)]J + Pe*[Tr + Tp*(M-1)/2]} 
M-1 
=Tef+ia*{Tr + Pog *Tp* D Pry i)*[((i-1)*G-2)]/2+ 


i=2 
(M-i+1)*(i-1))/M] + Pe*(Tef-Tp)/2} (1) 
LA is the mean number of failures during actual task execution 


time. Tef is Tp*M, the failure-free execution time including 
checkpointing overhead. 


(B) FTMR2M Method --- The mean task execution time is : 
Tp = Tef+ up*{Tr+ [(Tef-Tp)/2]*[1 - Pyy(D*Papl} (@) 


(C) Adaptive Method --- The mean task execution time is : 
Tc = Tp*M+Trb*(Trb/Tp)*(TC/Tp)+Uc* { Po¢*[Tr* 
M-1 
> Prb(i)]+P¢*[Tr+Tp*(M-1)/2]} 


i=l 
=(Tef + uc*[Tr + Pe*(Tef-Tp)/2]}/{1- (Trb/Tp)**2} (3) 


3.2. Comparison of Mean Execution Time 


For the simplicity of comparison, we assume the mean 
number of failures during task execution time is [' and 
remains the same for all three models. ' is WA, UR and uc 


for optimistic, FTMR2M and adaptive models respectively. 


ZA mparison of FMTRRM_ and Adaptiv 
Methods. Two performance comparisons will be studied, 
optimum and nonoptimum comparisons. The speedup of the 


FTMR2M method over the adaptive method is derived from 
Eqns. (2) and (3) : 
dT=Tp-Tc=[Tef(B)-Tef(C)*F]+u'*[(Tef(B)-Tp(B))*(1- 
Prb(1)*Pyg)/2-(Tef(B)-Tp(B))*Pe*F/2}+1'* 
[(Tr(B)-Tr(C)*F] (4) 
where F is 1/(1-(Trb/Tp)**2). 


2.1.1 Nonoptimum Performan mparison. We 
assume all parameters in Eqns. (2) and (3) are the same for 
both methods. The speedup can be approximated as : 
dT=Tef*[*((1-1/M)/2)*(1-Pyp(1)) - (Trb/Tp)**2)] (5) 
l= '*P¢, i-e., number of nonfatal failures. 


Assume iJ is 1 for simplicity. For the FTMR2M method 
to better the adaptive method, P,p(1) must be at least 0.98 if 


checkpointing overhead (Trb/Tp) is 0.1. If the overhead is 
only 0.01, P;p(1) will be at least 0.9999. Fig. 4.b depicts the 


effect of 1: and Trb/Tp on min{P,p(1)}. Fig. 4.a illustrates the 
difference in task execution time between these two methods. 


3.2.1.2 Optimum Performance Comparison. Some 
parameters aforementioned have to be adjusted so that 


optimum can be realized. This optimization has been studied in 
(9]-[11]. Based on this concern, Tp, Tr and M vary with 
recovery method since we assume Trb is the same for both 
systems. We use triplet {Tp(), Tr(i), M(i)} to represent the 
triple elements {Tp, Tr, M} for different methods. 

To optimize system performance, the following equates 
must hold : | 
Tp(C) - Trb =[ (2Trb * MTBF )!/2]/Py¢ (6) 
Tp(B) - Trb =[ (2Trb * MTBF )!/2)/[Py¢*Pp(1)] (7) 


where MTBF is the mean time between failures. The above 
equations are derived in the same way as in [9]. In FTMR2M 
method, every failure requiring more than one rollback will 
retstart the system from the origin of the task. Thus, they are 
essentially the same as fatal failures. Pp¢ in Eqn. (6) should 


then be replaced by Py¢*Pyp(1) to acquire Eqn. (7). 


Since (Tp-Trb)*M remains constant for both methods, 
we have M(B) $ M(C). Tr's are: 
Tr(B) = MTBF - [Tp(B)*exp(-Tp(B)/MTBF)]/[1- 


exp(-Tp(B)/MTBF)} (8a) 
Tr(C) = MTBF - [Tp(C)*exp(-Tp(C)/MTBF)}/[1- 
. exp(-Tp(C)/MTBF)] | | (8b) 


It is apparent that Tr(B) 2 Tr(C). 
Since Tp(B) varies as P,p(1) changes, we acquire the 
following : 
dTp=dTef+up* {dTr+[(dTef-dTp)/2]*(1-(Ppp(1)*Pap)+ 
[(Tef-Tp)/2]*Py¢*[-dPp(1)]} (9 
dTef < 0 and dTp > O improve, whereas dTr >0 and 
[-dP,p(1)] >0 degrade performance of the FT MR2M method. 
The last item in the above equation dominates others, making 
the optimized performance of the FTMR2M method even 
worse than that of the adaptive method. As P;p(1) decreases, 
the probability of restarts is a lot higher even though Tef is 


slightly shorter. That is why the performance of the FIMR2M 
method is worse. 


3.2.2, Comparison of Optimistic Unplanned and 
Adaptive Methods. The execution time speedup is Tc - Ta, 
1.e., AT: | 


M-1 
dT=Tef*[u* > (((G-1)*G@-2))/2+(M-i+ 1)*G-1))/M)*P,p@) - 


(1+P¢*(Tef-Tp)/2)*((Trb/Tp)**2)] (10) 
We assume the distribution of Prb(i) is geometric. In the 


geometric distribution, P;p(1) dominates other P;p(i)'s and 


Prb(i) 2 Prb(j) if i <j, which makes the optimistic unplanned 
method a superb performer. The total rollback overhead during 
task execution time is depicted in Fig. 5 for the three recovery 
models. 

4, Message Passin stem 

The difference among the three rollback recovery 
methods, in terms of rollback recovery overhead, between 
shared-memory and message-passing systems is none when 
there is no error. The difference surfaces when rollback 
recovery is needed. Due to this difference, the adaptive method 
is even better than the unplanned methods. 

The two unplanned methods require logging of 
interprocess communications, which in nature is the same as 
checkpointing synchronization in the adaptive method. 
Logging of interprocess communications requires that, upon 
checkpointing signal, outstanding interprocess 
communications be finished before other phases of 
‘checkpointing can proceed. Otherwise it can't guarantee the 
correctness of received messages. For instance, if a process 
Saves its internal states and executes the validation test while 
some interprocess communications are still outstanding, even 
if this process passes its validation test, that implies only that 
part of the interprocess communication occurred before the end 
of the validation test is valid. There is no guarantee on the later 
part of the interprocess communication. Hence, from the point 
that checkpointing is first recognized by one of the processes 
to the point that the checkpointing ends, all three methods 
behave identically. Only after this checkpointing period can we 
see the difference incurred by different recovery methods. 


4.1. Derivation of Mean Execution Time 


The equations remain much the same as those of 
shared-memory system except some minor modifications 
incurred by transmitting interprocess communication logs to all 
processes. 

(A) Optimistic Unplanned Method --- 


M-1 
TA = Tef+pa*(Tr + Pyof*Tcl+ Pog *Tp* % PrpG) *[(G@-1) 
i=2 
*(j-2)+(M-i+1)*G-1))/M] + Pe*(Tef-Tp)/2} (11) 
Tcl is the time taken to form the global interprocess 


communication log from partial local logs recorded by each 
process. 


(B) FIMR2M Method --- 

Tp = Tef + up*{Tr+ Ppe*Tcl + [((Tef-Tp)/2]*[1 - Prp() 
*Poel} (12) 

(C) Adaptive Method --- 


Tc={Tef+uc*[Tr+Pr*(Tef-Tp)/2]}/{1- (Trb/Tp)**2} (13) 
This is exactly the same as that of the shared-memory system 
since constructing global communication log is not required. 


5. Conclusion 


An adaptive rollback recovery system is proposed and 
compared to two other methods. Two essential components of 
this system, adaptive checkpointing and rollback methods are 
introduced and analyzed. Performances of these three rollback 
recovery methods have been analyzed and compared in terms 
of task execution time for both shared-memory and 
message-passing systems. The adaptive recovery method 
outperforms the other two methods whenever single-step 
rollback probability is low, checkpointing overhead is low and 
number of failures is high. Besides the comparison on the task 
execution time, we should also consider the fact that the 


adaptive recovery method indeed needs less hardware which 


implies less number of failures during task execution. We thus 
conclude that the above comparison is pessimistic, the 
performance advantage of the adaptive method is actually 
better than what is revealed in the above comparison. The 
adaptive rollback recovery method performs even better in a 
message-passing system. 

This adaptive method is mostly suitable for both 
shared-memory and message-passing systems with a 
centralized control mechanism. For sparsely distributed 


_ Systems, the unplanned recovery method, such as the one in 
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[7], is prone to domino effect, it has yet to be assessed which 
of the three methods is more efficient. Further research is 
necessary to compare these three different recovery methods 
on this type of system. 
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ABSTRACT 


This paper presents a methodology to study multiple latent 
errors and near-coincident fault discovery in the memory of a 
shared memory multiprocessor. The delay between the generation 
of an error due to a fault and its detection (error latency) can cause 
multiple latent errors and near-coincident fault discovery in a 
system. The latter effect is widely known to be catastrophic to the 
continued operation of a system even in highly fault tolerant 
systems. The methodology is illustrated on the Alliant FX/8 under 
real concurrent workload conditions over a five-day period. This 
study finds that for a conservative error rate of one error per day, 
one out of four errors may manifest itself as a multiple latent error. 
At the same error rate, 8% of the error discoveries are near- 
coincident in nature for a time-window size of 50 microseconds 
(approximately 250 instruction cycles ). A strong correlation 
between existences of multiple latent errors and their near- 
coincident discovery is quantified. 


_ 1. INTRODUCTION 


A prerequisite for designing high reliability systems is to 
understand the effect of faults and their manifestations. Behavior 
of faults in a computer system is not easy to comprehend. This is 
even more so in a multiprocessing environment, where the 
mannner in which faults manifest themselves is usually complex. 
Analytical models suffer from constraining assumptions and 
developmental complexity. An alternative are measurements and 
experiments on production multiprocessor systems. These aid the 
model building process and provide valuable insight for designing 
new systems. 


This paper studies the fault discovery process in a shared 
memory multiprocessor system. There is usually a delay between 
the generation of an error (caused by a fault) and its discovery by a 
detection mechanism. This time is commonly referred to as error 
latency. Long error latencies can potentially lead to accumulation 
of undiscovered errors (called latent or "lurking" errors) in the 
system. We define multiple latent errors as a condition within a 
system where two or more errors are yet undiscovered by the 
system. Latent errors can be major threats to the reliability of the 
system. This is because there exists a possibility that they can be 
discovered simultaneously, behaving as though multiple faults 
have occurred. Most recovery mechanisms however are not 
designed to handle multiple faults. There is also a possibility for 
multiple latent errors to be discovered close in time, thus stressing 
the error recovery mechanism. Such situations are referred to as 
near-coincident fault discovery and are known to be catastrophic 
in real systems [1,2]. 


The purpose of this experimental study is to quantify the 
characteristics of multiple latent errors and near-coincident fault 
discovery in a shared memory multiprocessor system under a real 
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concurrent workload!. A multiprocessing system presents a new 
dimension from the workload point of view, since a number of 
processes can be active at the same time. This casts a new 
perspective on the study of latent error behavior since the 
probability of error discovery is potentially higher. 


The experiment employs actual hardware measurements 
from an Alliant FX/8 system to simulate error occurrence in the 
system and to investigate multiple latent error occurrence and 
near-coincident fault discoveries. The Alliant FX/8 is a key 
component in the "Cedar" parallel supercomputer project at the 
Center for Supercomputing Research and Development in the 
University of Illinois at Urbana-Champaign [3]. The measured 
Alliant FX/8 runs the current version of "Xylem," the Cedar 
operating system. Specifically, the methodology is applied to the 
Alliant memory subsystem. The fault model used in this study 
assumes that a permanent error has occurred’. The physical 
mechanism causing the faults can be varied and do not affect the 
results. 


The results are unique in that they provide new insight into 
the behavior of multiple latent errors and near-coincident fault 
discovery in a complex parallel processing environment. At a 
conservative error occurrence rate of approximately one error per 
day, there is a 25% chance that errors cause multiple latent errors 
in the system. Thus one out of four errors may manifest itself as a 
multiple latent error. Further it was found that 8% of the error 
discoveries are near-coincident in nature with a time window size 
of 50 microseconds. It was also found that the probability of 
multiple latent errors tends to saturate after a threshold error 
occurrence rate of approximately one error per day. The 
probability of near-coincident fault discovery was found to 
saturate also, but at a slower rate. 


1.1 Related Research 


There is little or no research cited in the literature which 
experimentally investigates the occurrence of multiple latent 
errors or near-coincident fault discovery. Fault injection studies in 
the FIMP (Fault Tolerant Multiprocessor) showed that the most 
likely threat to system failure in the short run was arrival of two 
failures so close to each other that system reconfiguration was not 
possible [1,2]. These experiments used pin-level fault injection 
while running specific programs. An analytical model for near- 
coincident faults in NMR systems with different voting schemes is 
presented in [4]. The general validity of such a model however is 
not established. 


Other related research consists of experiments conducted to 
measure fault/error latency. Experiments to measure fault latency 
via pin-level fault injections in FTMP are discussed in [5]. In this 
study, the researchers measured latency times for faults in 


!When two or more processors are active the system is said to be in 
concurrent operation.. 


* An error is that part of the system state which is liable to lead to failure. 
The cause - in its phenomenological sense - of an error is a fault. 


different system components and obtained a standard distribution 
fit to their measured fault latency distribution. CPU fault latency 
for the digital microprocessor in FTMP was studied in [6,7] via 
gate-level simulation. A set of specific programs was used to 
exercise the CPU to reveal faults injected into the simulation. 


The above approaches and results are, however, not 
applicable in general to multiuser systems. More recently, latent 
fault behavior in the memory of a VAX 11/780 was studied in [8]. 
The memory system was instrumented for measurements, and 
fault/error latencies were calculated by simulated fault injection in 


the memory. Also the effect of workload on fault/error latencies — 


was investigated in [10]. 


Although the above studies investigate the subject of latency 
quite systematically, the question of multiple latent errors or 
near-coincident fault discovery is not addressed. As mentioned 
earlier past measurements indicate these problems as usually 
catastrophic to the system. 


The next section describes the experimental methodology 
used to calculate multiple latent errors and near-coincident fault 
discovery probabilities. Section 3 presents the results and 
discusses the multiple latent error and near-coincident fault 
discovery behavior seen. Section 4 summarizes the important 
results of the paper. 


2. EXPERIMENTAL METHODOLOGY 


Figure 1 shows the Alliant FX/8 components related to our 
study. Detailed information on the Alliant system is given in [9]. 
The system runs the current version of "Xylem," the Cedar 
operating system. Thus, from the software point of view, many 
features of the Cedar supercomputer are running on the Alliant 
FX/8. The workload on the Alliant FX/8 consisted mostly of 
scientific applications such as circuit simulation, weather 
modeling, digital animation and fluid dynamics. 


This study concentrates on the error characteristics within 
the main memory of a system. An important reason for this is, 
measured field results show the largest number of failures occur in 
the memory [10]. A large number of CPU errors can also be 


Computational Complex 


ee ee rere erase ee rere cere eeeereeseereses 


1 CPCO CPC] \ 
ik 


Figure 1. Configuration of Measured Alliant FX/8 
traced to the memory [11]. Further, since shared memory is a 
common resource, the possibility of it being the source of failures 
can be significant. 


2.1 Hardware Measurements 


The Alliant FX/8 backplane was sampled to collect data on 
memory access operations from the shared cache. A Tektronix 
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DAS 9200 with a 32K trace buffer was used for this purpose [12]. 
Hardware probes were attached primarily to the main memory 
address bus on the Alliant backplane. Other probes were used to 
monitor signals so that appropriate triggering could be performed. 


As mentioned in the Introduction, the measurements were 
performed while the system was executing concurrent workload. 
The measurements were conducted over a five-day period, 8am to 
5:30pm daily, Monday to Thursday and 8am to 3:45pm on Friday 
(primarily due to drop in concurrent operations). Samples were 
taken approximately every 4 minutes’, with each sample 
containing 8K address references (representing 8K machine 
cycles). The total measurement period was approximately 46 
hours. 


Table 1 shows the filtered version of the raw data output. 
The addresses represent memory block start addresses. The 
memory is accessed in blocks of 32 bytes (transfer size between 
the shared cache and memory). The fields cntlO and cntl1 provide 


Table 1. Concurrent Workload Memory Address Trace 


address entlO = cntl1 
QD3FF8 
000230 
000232 
000234 
OD3FF7 
1 AEOF7 


OC2FF5 


Line no. time stamp 


00033316579 
000333 1666B 
000333 166BC 


0003331670D 
000333167BB 
000333167CC 
00033316869 


additional status information about the state of the memory bus. 


2.2 Simulation 


The memory address trace obtained above was then used as 
input to a simulation system, which essentially reconstructed the 
address space into which simulated error injections were 
performed over the entire measurement period ( the simulator was 
driven by the address trace). An error was discovered when the 
time of error injection at an address location was less than or equal 
to the time of arrival of that address in the concurrent workload 
address trace. The simulation environment consisted of three 
simulators, ELS (Error Latency Simulator), MLEI (Multiple 
Latent Error Identifier) and NCFI (Near-Coincident Fault 
discovery Identifier). Detailed information on the simulation 
environment is given in [13]. 


For error injection purposes no distinction was made 
between specific locations within a block. Since the transfers from 
main memory occur in blocks of 32 bytes, an error in one location 
within the block is equivalent to an error in any other location in 
the block from a discovery point of view. This simplified the 
simulation somewhat and more importantly, smoothed out 
discontinuities arising out of the fact that the data were sampled. 


Simulated error injections were performed assuming an 
exponential distribution for error occurrence over the entire 
measurement period. The error injection rates (A) were varied from 
0.009 to 0.058 ( times 6 error injections per hour or "x6 ei/hr"). 
Address locations for the error injection were chosen randomly. 
The exponentially distributed intervals between error injections 
were also chosen randomly. 


In order to obtain statistically consistent results, 
approximately 600 faults were injected at each error injection 


> The sampling rate chosen reflects a compromise between an adequate 
sample size and delay in transferring data to a data logger. 


time. This is equivalent to the simulation being run 600 times for 
each error injection rate. In each run, a randomly chosen location 
is injected with an error. 


2.3 Measurement of Multiple Latent Errors 


Multiple latent errors occur when two or more errors are yet 
undiscovered in the system. In order to determine the probability 
of multiple latent errors at a given error injection rate, we first 
construct a latency profile for each injection. The latency profile 
for an injection is the profile of discovery times for all errors 
injected at that injection time. Once the time to discovery for each 
error injected is available, a latency profile can be plotted as in 
Figure 2. 


Error 1 


Error 2 


Latency Profile 


Error 3 
Error 4 
Error 5 
Error 6 


. Error 7 


Error 8 


Error Latency Time 
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Error Injection 


Figure 2. Example of a Latency Profile 


Consider for simplicity a case in which two error injections 
are made in a measurement period. Figure 3 shows the three 
possible overlap scenarios for the latency profile. The multiple 
latent error regions between the two error injections is shown. 
The multiple latent error region area versus the total latency 
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Figure 3(b). Type 1 Overlap 
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Figure 3(c). Type 2 Overlap 


profile area of both injections give a rough view of the probability 
of multiple latent errors in the system. The probability of multiple 
latent errors is defined as the ratio of number of errors in the 
multiple latent error region to the total number of injected errors. 


Let E; represent the number of errors injected at error 
injection number i. Also let Me;;, represent the number of errors 
of error injection i that exist as multiple latent errors with error 
injection j at the error occurrence rate A (e.g., Me;., represents 
number of errors in injection 1 that exist as multiple latent errors 
with injection 2 at the error occurrence rate } and Me2,, is the 
number of errors in injection 2 that exist as multiple latent errors 
with injection 1 at the error occurrence rate 1). Then between two 
error injections » and m where n<m,the probability of multiple 
latent errors Mp, , for an error occurrence rate of A is 


Menma+Me nnn 


MP am a E +E 
n m 


In the complex case of more than two error injections within 
the measurement period, the multiple latent error probabilities can 
be. individually calculated with respect to one particular error 
injection for the given error occurrence rate i (i.e., Mp2, iS a 
multiple latent error probability between fault injections 1 and 2, 
Mp,3, is multiple latent error probability between 1 and 3 etc.). 
For each Mp, , multiple latent errors may exist in either of the 
two forms shown in Figures 3(b) and 3(c). But given the 
definition of multiple latent errors ,where at least two errors must 
exist undiscovered in the system, only adjacent error injection 
probabilities need be considered. Thus only Mpj.,, Mpo3., Mp3an 
etc. values are used to give an overall multiple latent error 
probability (Mp,) for the error occurrence rate chosen. Thus, if ,; 
represents the number of error injections achieved at the error 
occurrence rate i, the overall probability of multiple latent errors 
at error occurrence rate A is 


i=n,;—1 
 MPayisya 
i=] 
Mpi= 7 


ei 


2.4 Measurement of Near-Coincident Fault Discovery 


In order to measure the probability of near-coincident fault 
discoveries, we first choose an appropriate time window of size T. 
Next we move this window over the total measurement time in 
increments equal to 7, each time observing the number of errors. 
discovered within the time window. The ratio of total number of 
errors found in that time window to the total number of errors 
injected gives the probability of near-coincident faults in the 
system. Note that, if errors from the same error injection are 
discovered within the time window, they do not qualify as a near- 
coincident fault discovery. 


The total measurement period is divided into n time slices ¢, 
t,, each T long except one (if true integer division is not 
possible). The number of errors discovered (from different error 
injections) in each 4, were N, 4 at an error occurrence rate A where 
1<k<n. Again n,; represents the number of error injections 
achieved at error occurrence rate A. If the total number of errors 
injected into the system is E, then, the probability of near- 
coincident faults (NC,) for an error occurrence rate of A is 


t=Ne 
where E= 3° E; 


i=1 


3. RESULTS 


This section presents the experimental results on multiple 
latent errors and near-coincident fault discoveries for the Alliant 
memory subsystem. Recall that errors were injected at 
exponentially distributed intervals (with an error injection rate A). 
The memory address trace was then used for determining multiple 
latent errors and near-coincident fault discovery probabilities. For 
purposes of this study, errors were injected in the high usage 
regions of the memory. The region of injection represented 93% of 
the address references in the real concurrent workload trace but 
occupied only an eighth of the memory address space available. 
Clearly, the behavior of faults in this region is critical for 
continued system operation. 


On the average, 14% of all errors injected remained 
undetected during the measurement period (approximately 5 
days). The choice of the error injection rate 4 for the experiment 
was chosen to reflect realistic error occurrence rates (see [10]). 
The range was chosen to be 0.009<A<0.058 (x6 error occurrences 
per hour - approximately 2 to 16 error injections over 5 days). 
The time-window sizes chosen for analysis in the near-coincident 
fault discovery calculations represent reasonable error recovery 
times for a high performance system. The time-window range was 
varied from 1 microsecond to 250 microseconds (approximately 6 
to 1500 instructions on the Alliant FX/8 ), 


3.1 Multiple Latent Errors 


Figure 4 shows the variation in the probability of multiple 
latent errors Mp (yi+1),0.043 during the measurement period for an 
error occurrence rate of approximately two errors per day (0.043 
x6 ei/hr). Figure 5 shows the probability of multiple latent errors 
being present in the system at different error occurrence rates. We 
find that the probability of multiple latent errors increases from a 
low of 0.04 at an error occurrence rate of approximately one error 


Probability 
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Figure 4, An Example of Multiple Latent Error Presence 
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every two days to a high of 0.50 which is more or less a saturation 
probability. The oscillatory behavior of the graph is primarily due 
to statistical variations. 


Figure 5 shows that, at a conservative error occurrence rate 
of approximately one error per day (A=.022), there exists a 25% 
chance (Mp,20.25) of multiple latent errors. This suggests that one 
out of four errors has the potential of manifesting itself as a 
multiple fault. On further examining of the plot in Figure 5 , we 
find at low error injection rates the plot has a higher slope than at 
high error injection rates. As expected the error occurrence rate (or 
the number of error injectiohs) does have an impact on the 
multiple latent error probability, but this effect subsides as the 
error rate increases. The reason for this is that at higher error 
occurrence rates seemingly more latent errors tend to be 
discovered or "swept away", thereby resulting in a tapering effect 


Multiple Error Behavior 
with respect to the Error Occurrence Rate 
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Figure 5. Probability of Mulitple Latent Errors 
on the plot. 


To show in detail how the multiple latent error probability 
changes during the course of error injections, a plot of variation in 
probability of multiple latent errors for an error occurrence rate of 
0.031 (x6 ei/hour) is shown in Figure 6. There were nine error 
injections in the measurement period for this error occurrence rate. 
Each dotted line represents a multiple latent error probability plot 
with respect to a specific error injection number. L, represents 
multiple latent error probabilities for error injection 1 (£,) with 
error injections E,, E3, E, and Es. The Mpyj20031, MP13,00315 
MP 140.031 aNd Mpyso03; values are represented on this line. 
Similarly L, represents multiple latent error probabilities of error 
injection two ( Mp30.031» MP 240.031 aNd Mpos5.903; ) and so on. A 
downward behavior is seen for all the lines. This seems intuitive; 
say for L,, the errors of E, will tend to be discovered as time 
progresses, thereby reducing the probability of multiple latent 
errors being present in the system when E, is introduced. 


To highlight one of the error discovery patterns in the 
system, the high multiple latent error probability of 0.9 for L, will 
be explained. We find that most of the errors injected in error 
injection 2 are discovered during the interval between error 
injections 3 and 4. This is because Mp3 003;=0.9 implies that both 


_ Multiple Error Behavior with Successive Injections 


Prob.of Multiple 
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Figure 6. Probability of Multiple Latent Errors at Diff. Injections 


Me 30.931 and Me 39 903; have high values. The multiple latent errors 
probability plot of error injection 2 continues till error injection 5. 
This shows that all errors discovered during the measurement time 
period are discovered before error injection 6. Thus the remaining 
errors injected at error injection 2 are discovered between error 
injections 2 and 3, error injections 4 and 5, and error injections 5 
and 6. 


3.2 Near-Coincident Fault Discovery 


Figure 7 shows the variation of probability of near- 
coincident fault discovery with time-window sizes from 10 to 250 
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microseconds for three different error rates. As expected, the 
near-coincident fault probability increases monotonically with 
time window size. But however, the rate of increase in probability 
of near-coincident faults slowly decreases for larger time-window 
sizes. Figure 8 shows a microscopic view (1 to 10 microseconds) 
of the behavioral change in the near-coincident fault probabilities. 
The step function behavior is easily understood by the fact that if 
we have near-coincident faults in time-window size 7, then those 
same near-coincident faults must exist in time-window size T+1. 


The variation of probability of near-coincident faults ,for 
three time-window sizes (10us, 100us and 200us), is shown in 
Figure 9. The range of error rates used is 0.009SA<0.049 (x6 error 
injections per hour ), approximately 2 to 14 error injections over 


Near-Coincident Fault Discovery Variation 
With Time-Window Size (Microscopic) 
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Figure 8. Microscopic Time Window Size Variation of Probability 


Near-Coincident Fault Discovery Behavior 
with respect to the Error Occurrence Rate 
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Figure 9. Probability of Near-Coincident Fault Discovery 


the measurement period. From Figure 9, the near-coincident fault 
probability values range from 0.003 to approximately 0.21, over 
the 10 to 250 microsecond time-window size. 

In comparing Figure 5 and Figure 9, we can see there exists 
a high correlation between the existence of multiple latent errors 
and their discoveries in near-coincidence. From Figure 9, the plot 
after an initial steep rise starts to taper as in the multiple latent 
error probability case. The saturation effect comes about more 
slowly in Figure 9 though, becoming more apparent at higher error 
occurrence rates than Figure 5. The reason for this is, as the rate of 
number of latent errors being "swept out" increases, the 
probability of near-coincident fault discovery also increases as a 
side effect. However after a certain error occurrence rate, the rate 
of removal of latent errors from the system has a more pronounced 
effect on the probability of near-coincident fault discovery. Thus 
the probability plot saturates slower as a result. 


4. CONCLUSIONS 


This paper has described a methodology to study the 
behavior of multiple latent errors and near-coincident fault 
discovery in the memory subsystem of a shared memory 
multiprocessor. Past studies have shown that these effects can 
seriously degrade the reliability of a system. The methodology 
was illustrated on a production multiprocessor system, the Alliant 
FX/8, running the operating system environment of the "Cedar" 
supercomputer. . 


The results show that even with a conservative error 
occurrence rate of one error per day, there is a 25% chance that 
errors result in multiple latent errors. It was also found that 8% of 
the error discoveries are near-coincident in nature with a time 
window size of 50 microseconds. It was also seen that the 
probability of multiple latent errors tends to saturate after a 
threshold error occurrence rate of approximately one error per day. 
The near-coincident fault discovery probability increases 
monotonically with larger time-window sizes. A_ strong 
correlation was found between the existence of multiple latent 
errors and their near-coincident discovery. The saturation effect 
on probability of near-coincident fault discovery was seen to take 
effect slower than that for the probability of multiple latent errors. 


Future work is expected to involve investigation of methods 
to use such experimental results to make reliability and 
availability predictions for measured systems. It is suggested that 
other parallel systems be similarily studied so that more 
information on error characterization is available. 
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Abstract 


The PEPSys project is concerned with the definition and 
evaluation of a parallel logic programming system 
addressing a complete spectrum of issues, from high-level 
language and applications to implementation on machine 
architectures. This paper discusses the design issues and 
trade-offs involved in the specification of a particular 
distributed architecture to support the sequential, OR , and 
Independent AND-parallel mechanisms of PEPSys. Of 
particular interest are the load balancing strategies adopted 
and their evaluation by simulation. The simulation of the 
architecture has also produced many other results which are 
presented along with a discussion of their implications. 


1. Introduction 


The general goal of the PEPSys (Parallel ECRC Prolog 
System) project, which started in mid - 1984, is to study and 
evaluate new and practicable solutions to the problenis of 
parallel logic programming which was found to be a useful 
vehicle for expressing parallelism [6]. The PEPSys 
language, [4] was designed to exploit the OR and 
Independent-AND parallelism inherent in declarative logic 
programming languages. Together with the language, a 
superset of conventional PROLOG, a new computational 
model [7] and an abstract machine were designed. Besides 
the authors, all members of the PEPSys team have 
contributed to this work: H. Westphal, D. Peterson, 
J. Chassin, JC Syre. 

Of particular interest, is the class of architectures most 
amenable to efficient execution of PEPSys. A class of 
architectures which supports the basic characteristics of the 
language, its computational model and abstract machine has 
been identified. This paper describes the architecture, its 
features, the underlying design philosophy and presents 
some preliminary performance results obtained trom 
simulation of the architecture. The problems of load- 
balancing in the machine are discussed. A more detailed 
description can be found in [1]. 

The PEPSys computational model provides efficient 
solutions to problems central in any parallel Prolog 
implementation: the management of variable bindings in a 
parallel environment and the control of the search space to 
produce all (wanted) solutions. Its main features, detailed in 
[7], are retro-active parallelisation at very little cost, 
shallow binding with an explicit time-stamping mechanism 
and full combination of OR and AND parallelism with 
sequential backtracking execution. The implications of such 
a model on this architecture are discussed in the next 
section. 


2. The PEPSys Architecture 


The overall goals of PEPSys have greatly influenced the 
specifications of this architecture. In this section, an 
overview of the major decisions made in the design of the 
architecture is presented including justification thereof. 


cluster cluster 


communication network 


cluster cluster 


Figure 1; PEPSys Multi-Cluster Architecture. 


Requirements of an Architecture 

The architecture must deliver a scalable performance as the 
computing power of the machine is increased. Therefore it 
was critical to restrict parallelism to where it is really useful 
and to limit the increased communication overhead. A 
further requirement was that the architecture be ‘open- 
ended’ to allow for future modifications and extensions to 
be incorporated with relative ease. Finally, the architecture 
should be flexible enough to allow the machine to assume 
different roles, e.g. as a backend symbolic processor or as a 
front-end dedicated Prolog machine. To match the coarse- 
grained parallelism of PEPSys with communication costs, a 
cluster based design was chosen. Figure 1 depicts the 
abstract view of the cluster architecture. 


2.1. Architectural Specifications 


Making use of the experience gained in the implementation 
of PEPSys on a shared-memory Siemens MX-500 machine 
[3], the number of PEs has been scaled up by adding more 
Clusters, thereby introducing the notion of distance between 
PEs; when PEi of cluster j wishes to access a variable on 
PEk on cluster /, it induces two levels of communication: 
intra-cluster and inter-cluster. 

The PEPSys computational model solves the problem of 
maintaining multiple binding of variables in an OR-parallel 
environment through the use of hash windows. Combining 
(and nesting) AND-parallelism with OR-parallelism is done 
using an additional data-structure, join-cells together with 
hash-windows. From the architectural point of view - such a 
model does not impose the choice of memory used in the 
architecture (shared or unshared) as a process is an 


- independent entity, identified by a process number, a hash- 
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window and a root-frame [5], and accesses common 
variables by searching ancestor hash-windows. Thus the 
implementation of PEPSys using shared or private memories 
for each processing element is possible. 


2.2. The Structure of a Cluster 


A cluster is a small number of identical processing 
elements sharing a common memory and communicating 
with the shared memory through a high-speed bus (Figure 
2). Several successful attempts have been made to 
implement OR-parallel computational models on limited- 
resources, shared memory machines [2] and therefore it was 
natural to extend this approach by connecting several such 
clusters to a communication network. A program’s search 
space could then be divided between several clusters in 
order to improve performance and to be able to run much 
larger programs. 


shared Network 


Communication 


memory Interface Network 


Figure 2: A cluster connected to a communication 
network. 


It is clear that while increasing the processing power of the 
machine, its complexity has increased in the form of the 
extra level of communication between clusters. To alleviate 
this undesired consequence the problem was attacked on 
two levels: adding additional hardware and_ using 
sophisticated methods to reduce such communication. Each 
cluster is augmented with a Cluster Processor (CP) whose 
primary function is to handle inter-cluster communication. 
Other CP functions include: servicing remote dereferencing 
requests, managing local load-balancing (acquiring remote 
work when necessary), servicing PE requests, aborting 
processes and local bookkeeping. 

The Inter-cluster Communication Network 

The CPs communicate with each other over a common bus 
via message passing. The decision to use message passing 
between clusters was based on the following factors: 


e message-passing is far more flexible and 
enables the implementation of different inter- 
cluster communication networks 


e message-passing lends itself naturally to a 
distributed architecture in which individual 
components execute asynchronously. 


e messages are of a higher level of abstraction - it 
is possible to implement a global address space 
with messages. 
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At implementation level the distribution of work amongst 
clusters is restricted to avoid a communication bottleneck. 
This was corroborated by simulation results for fairly large 
programs, which show an average bus utilisation of between 
12%-30% of the total runtime, using a communication- 
inducing configuration: 1 PE per cluster for 10 clusters. 


2.3. Cluster Processor (CP) 


The Cluster Processor is viewed as the cluster’s ‘*work- 
horse’’ - a powerful processing unit performing a host of 
tasks, whose ultimate aim is to satisfy the local PEs’ 
demands for work, values etc., while doing additional work 
in its idle time to speed up overall performance. 


intra-cluster bus 
CPU 
res special 
bus interface [memory] 
message buffering inter-cluster 
; network 
unit interface 


IN-buffer OUT-buffer 
LOCAL-buffer 
dereferencing 
cache 


Dereferencing 
Unit 


dereferencing 
buffer 


variable 
dereferencing 
unit 


hash window 
unit 


Figure 3: A Cluster Processor Block Diagram. 


The block diagram of a cluster processor (CP) is shown in 
Figure 3. The CP has four message queues managed by 
hardware as FIFOs. The size of these buffers is small as 
only a few messages are expected to be in a queue at any 
given time. Buffer overflow is handled by directing excess 
messages to the CP’s private memory. The CP has a small 
private memory used for cluster bookkeeping, temporary 
scratchpad and overflow areas. It has a fairly large set of 
dedicated registers in addition to a set of general purpose 
registers. The dedicated registers are used for fast access to 
CP data tables and counters. The basic execution cycle of a 
CP is sketched below: 


loop: 
{ process a message from each buffer 
if all buffers are empty then 
do_idle_time_work 
J 


3. Load Balancing Strategies 


Work is distributed between processing elements at two 
levels: intra-cluster and inter-cluster. The intra-cluster work- 
distribution strategy is an extension of [5] and is not 
discussed here; in this section the load balancing scheme 
implemented is described along with other possible schemes 
suitable for the architecture. 

A workpool containing potential pieces of work for OR- 
branches and AND-branches is maintained in each cluster. 


As these workpools are global and can be accessed by any 
PE in the cluster and of course by the CP; mutual exclusion 
must be ensured when modifying the workpool. 

A PE executes a local-search-for-work procedure when it 
becomes idle, attempting to find work in one of the cluster’s 


local workpools. If the PE finds work - it modifies the - 


workpool and executes the work. On the other hand, if no 
work is available, the PE sends a message to the local CP 
asking for remote work. In other words, a lazy scheme for 
load-balancing, based on demand-driven activation by idle 
PEs is used. A backtracking PE can either wait for its 
children to terminate, or it can speed up their termination by 
taking work from their descendants (constrained remote 
work). The latter scheme was implemented and it was 
found that this operation must be severely restricted to 
prevent delocalizing computation (by the spreading of too 
small sized sub-trees) and thrashing of goals between 
clusters. Figure 4 depicts a simple example: 


goal: hy,p,. 
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Figure 4; A Load Balancing Example. 


p and q are or-parallel predicates and in the example their 
clauses are numbered p/,g1,p2,g2... for clarity. Assume the 
first clause of p is executed on cluster? by PE/ and the 
second clause is executed by PE2 on cluster2. PEJ executes 
h, followed by p/ then r then s (Figure 4). Execution of s 
fails resulting in PE backtracking to p/, now PE/ cannot 
backtrack beyond this point until p2 has terminated on 
cluster2. At the same time p2 has spawned gq, another 
parallel predicate which in turn may have alternative clauses 
‘stolen’ by other clusters. PE/’s decision to ask for work 
from p2 or rather to wait for p2’s termination is extremely 
difficult to make, depending heavily on a particular 
program’s behaviour. This kind of mter-cluster work 
distribution can be restricted by ‘capturing’ entire sub-trees 
in one cluster, or by refusing such a work request on the 
remote cluster. 

Only half the picture has been explained. The CPs play a 
crucial role in distributing the load over the entire machine. 
The CP maintains a queue of idle PEs which have requested 
‘remote’ work. Periodically, this queue is inspected by the 
CP and when local conditions are met - the CP selects a 
remote cluster and sends it a ‘request_for_work’ message. 
The CP manages a list of- remote CPs to query for work, 
based on partial information it has obtained while polling 
these CPs and from messages received from them. 
Alternative approaches would be to remove polling CPs 
altogether and provide some random choice function instea 
or to monitor the bus. 2 

A CP receiving a request for work initiates the 
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search _for_work procedure. If the cluster is too busy or 
alternately too idle, it can refuse the request immediately. 
When work is found it is sent to the requesting cluster and 
the local workpool structures are updated. 

This load-balancing strategy places the burden of finding 
work locally on the PE, which is idle anyway. The CPs 
control the amount of external work requests outstanding 
per cluster, even so, requests for work must be kept to a 
minimum through the use of pragmas (at program level) and 
run-time information. On the receiving side - finding work 
for remote PEs is done by the CP without interrupting local 
PEs. 
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Figure 5: 8 Queens problem with work distribution 
optimisation. 


By reserving branches high up in the search tree for remote 
clusters, additional performance gains averaging 20 per cent 
have been achieved. Figure 5 shows the performance for 
various configurations of the architecture running the 8 
queens, all solutions program with this improved, initial 
work distribution. The speedups delineated in the graphs 
below, are obtained by comparing against the performance 
of a uni-processor on the same benchmarks. 


4. Performance Analysis 


In this section some preliminary performance results 
obtained from architectural simulation are presented. A 
multitude of configurations with the number of clusters and 
PEs ranging between one to ten were simulated. For a 
single cluster architecture up to 30 PEs were simulated. No 
optimizations of any kind were included in_ these 
architectures, i.e. the CP was an ordinary processor running 
at the same speed as a PE, it had no parallel sub-units and 
management of its message buffers was done by software. A 
simple load-balancing scheme was employed: no restrictions 
were imposed on suspended PEs’ requests for work from 
their (remote) children and a threshold equal to half the 
number of PEs in a cluster was used to control external 
requests for unconstrained work. In the implementation of a 
cluster none of the important optimizations to PE 
dereferencing or local work management have been made. 
The graph in Fig. 6 shows the speedup obtained for the 8 
queens program, run on a particular set of configurations: 4 


PEs per cluster, 6 PEs per cluster, 8 PEs per cluster and 10 
PEs per cluster, for 1 to 10 clusters. The three factors 
mentioned below account for the less-than ideal speedups 
achieved: 


e simple load-balancing - 


*the scheme should severely restrict the 
taking of remote work 


¢ the initial ‘spreading’ of work has to be 
improved 


© nO optimizations were performed - the CP must 
execute faster than its local PEs and should 
have asynchronous, hardware sub-units. 


e the test-programs must be large enough with 
respect to the amount of sustainable parallelism 
they exhibit 
Detailed statistics were gathered from two representative 
configurations: a 100 PEs on 10 clusters, and 10 PEs on 10 
clusters. As expected, the inter-cluster bus does not cause a 
communication bottleneck. This is due largely to the 
process-oriented nature of PEPSys’ computational. model 
which was discussed previously. 
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Figure 6: Overall performance of 8 Queens for different 
cluster configurations. 


The distribution of communication (in the form of message 
passing) between any two clusters in a configuration is 
almost uniform, excluding the cluster initiating the 
computation which always has a heavier load. By dividing 
the work to be done at the highest possible level in the 
search tree, this additional overhead can be eliminated on 
the initiating cluster. 


5. Conclusion 


The architecture presented above fulfills the initial 
requirements - 


ean increase in processing power is viable 
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through parallelism and is its implementation in 
hardware is feasible with state-of-the-art 
technology | 


e flexibility - communication is limited and 
evenly distributed between clusters making 
replacement of the communication network 
easy, once the target environment of the 
machine is known. Adding complexity to the 
CP can be done in a Straightforward and 
efficient manner. 


Partial answers to questions posed by the computational 
model such as the frequency of dereferencing, the 
availability of long sequential branches in PEPSys programs 
and hash-window chain lengths have been obtained. Many 
important questions regarding the implementation of 
PEPSys were investigated, in particular, work-distribution 
Strategies which were found to influence performance 
immensely. The concept of distance between processing 
elements was introduced, allowing greater processing power 
while not vitiating performance too severely. More ‘‘real’’ 
programs need to be measured to provide empirical 
validation of the design and performance must be boosted 
significantly. The existence of a large class of applications, 
generating sufficient amounts of parallelism to sustain the 
machine, must be ascertained. 


References 


[1] U. Baron, B. Ing, M. Ratcliffe, P. Robert. 
A Distributed Architecture for the PEPSys Parallel 
Logic Programming System. 
Technical Report 25, ECRC, Nov, 1987. 


[2] R. Butler, E.L. Lusk, R. Olson, R.A. Overbeek. 
ANLWAM - A Parallel Implementation of the 
Warren Abstract Machine. 
Internal Report, Argonne National Laboratory, 1986. 


[3] Chassin, J., Westphal, H. and Peterson, D. 
The Implementation of PEPSys on a MX-500 
MultiProcessor.. 
Internal Report, ECRC, December, 1987. 


[4] M. Ratcliffe, J.C. Syre. 
The PEPSys Parallel Logic Programming Language. 
In JJCAI. ECRC, August, 1987. 


[5] Philippe Robert. 
An Emulator for the PEPSys Abstract Machine. 
Internal Report PEPSy 17, ECRC, April, 1987. 


[6] Akikazu Takeuchi and Koichi Furukawa. 
Parallel Logic Programming Languages. 
In Third International Conference on Logic 
Programming, pages 242-254. July, 1986. 


[7] H. Westphal, P. Robert, J. Chassin, J.-C. Syre. 
The PEPSys Model: Combining Backtracking, 
AND- and OR-parallclism. 
In Proceedings - 1987 Symposium on Logic 
Programming, pages 436-448. IEEE, 
September, 1987. 


A DATAFLOW ARCHITECTURE FOR OR-PARALLEL EXECUTION OF LOGIC PROGRAMS 


A. V. S. Sastry and L. M. Patnaik 
Department of Computer Science and Automation 
Indian Institute of Science 
Bangalore 560012 


ABSTRACT Logic programming languages have gained 
wide acceptance because of two reasons. First 
is their clear declarative semantics and the 
second is the wide scope for parallelism they 
provide which can be exploited in parallel 
implementation. In this paper, a dataflow 
architecture (based on Manchester ring) to 


support OR-Parallelism and Argument Parallelism 


is proposed. A new scheme for handling deferred 
read mechanism using the matching unit of the 
machine is suggested. The required data 
Structures and the built-in dataflow procedures 
for OR-parallel execution are discussed. 
Multiple binding environments’ are handled by a 
modified form of directory tree method that is 
suitable for dataflow implementation. This 
method is illustrated by an example. The 
Gataflow graphs of the program clauses are calls 
to the built-in procedures, therefore they are 
modular and independent of argument complexity. 
This feature makes the compilation of the 
clauses very easy. 


1. INTRODUCTION 


Logic programming is a novel programming 
style with clear declarative semantics. It 
means that the user program is more like a 
specification of the problem than the 
specification of the algorithm - as is the case 
with conventional von Neumann languages. 
Therefore writing programs in this paradigm is 
very simple and elegant. Comparing logic 
languages with functional languages, we find 
that a logic program can be thought of as an 
equivalent of a set of functional programs, one 
functional program corresponding to one instance 
of the input-output mode of the arguments of the 
clause in the logic program. Hence logic 
programs are more compact as compared to 
functional programs. 

One important application of logic 
programming is in its use in knowledge 
representation and reasoning. Both declarative 
and procedural knowledge can be represented 
quite succinctly in the form of clauses. Facts 
and rules can be represented with the same ease. 
Reasoning with knowledge can be done using the 
inference rule of first order logic called 
resolution. This ability makes this language 
paradigm quite suitable for artificial 
intelligence applications. 

A promising feature of logic languages fron 
implementation point of view is that they do not 
obscure any parallelism present in the program. 
Parallel architectures based on control flow, 
dataflow and reduction,[{3,10..15] have been 
proposed. The motivation for parallel architect- 
ures is to speed up symbolic computation - where 
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problems are quite unstructured and weak methods _ 
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“environments. 


of problem solving are applied. 

In this paper, we propose an extension of 
Manchester dataflow machine that can support OR- 
Parallelism of logic programs. In addition to 
that, our machine also supports Argument 
Parallelism. A new scheme for handling the 
deferred read mechanism , which permits a read 
request at an empty memory location, using the 
matching unit of the machine is discussed. This 
Simplifies the design of the memory modules. We 
propose Suitable data structures and explain how 
the OR-Parallel execution is possible on our 
machine. The rest of the paper is organized as 
follows. Section 2 contains a brief description 
of the definitions and the basic computational 
model of logic programs. Section 3 gives a 
description of our architecture and discusses 
the deferred read mechanism. Section 4 deals 
with the data structures, dataflow procedures 
and the scheme for handling the binding 
Section 5 contains the 
preliminary simulation results of execution of a 
simple logic program on this machine. 


2. PRELIMINARIES 
2.1 Basic Definitios 


A logic program is a set of clauses 
expressed in first order logic. We restrict 
ourselves to a specific subset of first order 
logic called Horn clause logic. The syntax of a 
clause is given below: 

A:-B1,B2,.++,BNn 
where A,Bl,B2,...,8n are Called Predicates which 
are relations over the given domain, A is called 
the head of the clause and B1,B2,...,Bn together 
constitute the body of the clause. Each 
oredicate has a fixed arity. A predicate is 
represented as 

P(tl,t2,t3,...,tn) 
where P is the predicate name and tl,t2,...,tn 
are the arguments which are the terms of the 
first order logic language. An example of a 
predicate is Father(john,mary) which asserts the 
relationship between two terms john and mary of 
the domain. 


A term is recursively defined as 

1. A variable is a term, e.g. x,y and z 

2. A functor F(tl,t2,...,tn) is a term where 
tl,t2,...,tn are terms and f is the functor 
name. The arity of the functor isn. The 
constants of the domain are the functors with 0- 
arity. 


There are three kinds of clauses. A unit clause 


does not have a body. A definite clause has both 
head and body. A goal clause has body but no 
head. 


2.2 Interpretation of Logic Programs 


The declarative meaning of a clause A:- 
B1,B2,...,8n is that the predicate A is true if 
B1l,B2,--.,Bn are simultaneously true. A unit 
clause is unconditionally true as it does not 
have a body. Unit clauses form the facts of the 
program. Definite clauses are rules and goal 
clauses are the intended queries made on the 
logic program. In logic parlance, the set of 
Clauses form the set of axioms. A goal clause 
is a theorem to be proved. The meaning of 
execution of a logic program is to find the 
instances of the goal clause implied by the 
given set of clauses. 

From the point of view of execution of 
logic programs, Kowalski has given a nice 
procedural interpretation to logic programs. In 
this view, each clause can be considered as a 
procedure. The body of the clause is nothing 
but a set of procedure calls. A goal clause can 
be considered as the initial set of calls to the 
various procedures in the programs. The passing 
of parameters from goal to body is done by a 
bidirectional syntactic pattern matching 
procedure called unification. 


2.3 Basic Computational Model 


The underlying model of computation is 
unification. Comparing it with reduction which 
is the computational model of functional 
languages, we find that in case of unification 
there is no commitment of the variables as input 
Or output. Specification of a subset of 
variables of the goal clause and execution of 
the program result in the solution which 
specifies the values of the unspecified 
variables. In reduction, a set of variables is 
designated as input variables. Input variables 
have to be specified to get the output implying 
that reduction is unidirectional. The 
unification of two predicates results in a 
minimal set of variable bindings called the 
Binding Environment (BE). If the two predicates 
are unifiable, the BE created is unique and 
known as the most general unifier of the two 
predicates. If the two predicates are not 
unifiable then the result of unification is a 
‘fail’ message. 

In order to solve a goal in a given logic 
program, an inference rule called resolution is 
applied. The basic algorithm for solving the 
goal clause of a logic program is outlined 
below: 

Initialize goalset to the given goal 
While goalset not empty do 


Begin 
stepl : select a goal from goalset 
step2 : find the matching clause/clauses 


step3 : unify the head of the clause and the 
goal to generate the BE 

step4 : pass the bindings to the body of 
.the selected clause and include the 
body in the goalset 

End 
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2.4 Parallelism in Logic Programs 


In the above mentioned computational model, 
many steps can be performed in parallel. They 
are classified[2] as follows: 

Search Parallelism: Searching for the candidate 
Clauses for unification and resolution can be 
done in parallel by associative search. 
OR-Parallelism: When more than one candidate 
clause are present in the program, all of them 
can be attempted simultaneously in order to 
obtain alternative solutions. 

AND Parallelism: Solving a goal clause reduces 
to solving the subgoals in that clause. These 
subgoals can be solved simultaneously the only 
constraint being the consistency of the bindings 
generated by the subgoals. 

Argument Parallelism: The arguments of the two 
predicates can be unified in parallel. The 
parallel unification requires consistency check 
on the BE. The reason is that shared variables 
taking part in unification should get bound to 
consistent values. 


3. THE DATAFLOW ARCHITECTURE 


The motivation to choose the dataflow 
architecture for executing logic programs is the 
inherent computational structure of the problem. 
A goal in a logic program initiates the 
computation, therefore the program is goal- 
driven. Thus if the goal is considered as a 
data item, there is a direct correspondence 
between execution of logic programs and 
execution of dataflow graphs on a dataflow 
machine. 

Our architecture is based on the Manchester 
ring[4,5,6]) which is shown in figure 1. The 
original architecture does not provide any 
special hardware for structure handling. Array 
data structures are supported by the matching 
unit using specialized matching functions[7]. 
Recently Manchester machine has been augmented 
with structure store(9] similar to Arvind's I- 
Structure store{1]. As the execution of logic 
programs requires efficient handling of 
structures, we also provide the structure memory 
(SM) modules in our machine. The structure 
memory is functionally same as Arvind's I- 
structure memory. The basic architecture of the 
proposed machine is shown in figure 2(a). We 
have added one more unit to the machine which we 
call the Definition Search Unit. 

A logic program is assumed to be compiled 
into a set of definitions. A definition is the 
set of clauses having the same head. The goal 
predicate identifies its definition and attempts 
to unify its arguments with all the clauses in 
its definition. This function of selecting the 
candidate clauses for unification is 
ccomplished by the definition search unit which 
is shown in figure 2(b). It has two memory 
units Definition Search Memory and Clause 
Address Memory. The clause address memory stores 
the starting address of the dataflow graphs of 
each of the program clauses such that addresses 
of all the clauses in a definition are stored 
contiguously. Starting address of each 


definition in the clause address memory is 
stored in Definition Search Memory. When a goal 
arrives at the definition search unit, it 
searches for the address of its definition in 
the definition search memory. After getting the 
Starting address of the definition, a copy of 
the goal is sent to all the clauses of that 
definition using clause address memory. 
the purpose of the definition search unit is to 
initiate computation in all the solution paths 
of the goal simultaneously. 


3.1 Deferred Read Mechanism 


In our architecture, we provide a new way 
of handling deferred read requests using the 
matching unit. The reason to do so is two-fold. 
First, it simplifies the design of the memory 
units. Second, communication between the 
processor and the memory is through a bus, 
minimizing control in the memory units would 
reduce the processing time of the memory unit 
thereby reducing the latency between the 
processor and the memory module. Another 
advantage is that the hashing mechanism of the 
matching unit({4,5,6] can be used to support the 
deferred read mechanism without any extra 
hardware. 

A token in the machine can be represented 
as a tuple | | 
<data,c,i,destination,operand type, token type> 
The first three fields following the data field 
namely c, i and destination fields are required 
to identify an instruction of a particular 
invocation[5] where ¢c and i called the color and 
the iteration count constitute the tag. The 
‘operand type' decides whether the token 
requires matching. We define another field 
called token type. This is necessary for 
supporting deferred read mechanisms. The various 
token types are ‘ordinary' ‘deferred ' and 
‘signal release'. The ‘ordinary’ tokens are the 
ones which are generated by the processor during 
the normal course of execution of the dataflow 
graph. The other two types of toxens namely 
"Deferred * and ‘Signal Release’ are generated 
when a memory.read operation at a particular 
location occurs before the memory write 
operation at that location. Tneir generation 
and use are described below: 

A memory location can be in one of the 
three states, present, absent, or waiting [1]. 
When a memory read is requested at a particular 
location 'l', there are three possibilities. If 
the state of the memory location is present, the 
read is said to be successful and the result is 
routed to the destinations of the read 
instruction. If the state is absent or waiting, 
the read instruction cannot be satisfied 
immediately because the location does not 
contain any data value. Such a read request is 
deferred[1]. The processor changes the state of 
the memory word to waiting and generates a 
'deferred' token of the format 
<dest-i, c, iter count, l, op-i, "deferred"> 
where 'dest-1' and ‘'op-i' of the deferred token 
are the destination and operand type of the ith 
result token of the read instruction. The 


Thus. 


416 


destination field is '1' which is the address of 
the memory location where the read was 
attempted, ‘c' and ‘iter count’ are obtained 
from the tag of the read instruction. The 
number of ‘deferred’ tokens generated is equal 
to the number of destinations of the read 
instruction. These 'deferred' tokens wait in the 
matching unit for the 'signal release’ token 
which is generated by some write instruction at 
the memory location 'l'. 

When a write instruction into the memory 
location 'l' is executed, there are three 
possibilities. If the state is present, the 
write instruction is invalid{1]. If the state 
is absent, the data is written into memory 
location 'l' and its state is changed to 
present. If the state is waiting, the data is 
weitten into the memory location 'l', its state 
is changed to present and the a ‘signal release! 
token of the format 

<dl, 0, 0, l, —s “signal release"> 

is generated by the processor. The data field 
contains 'dl' which is the value written in the 
memory location 'l'and the destination field 
contains 'l' which is the address of the memory 
location where write operation is performed. 
This 'signal release' token searches for a 
partner token present in the matching store. A 
token 'i' is its partner if the destination 
field of ‘i' matches with the destination field 
of the incoming 'signal release' token and the 
token 'i' is ‘deferred'. The 'signal release' 
token extracts all the '‘deferred' tokens from 
the matching unit which match successfully with 
it. Corresponding to each matched token 'i' of 
the matching store a new token 'k' of the form 
<dl, c, iter count, dest-i, op-i, “ordinary"> 
is generated. This token is nothing but the 
result of the read instruction at the memory 
location 'l'. Its data field contains dl, the 
content of the memory location 'l' and is 
obtained from the data field of the 'signal 
release’ token. The destination field of token 
'k' contains the destination to which the this 
‘ordinary’ token 'k' should be routed. The 
destination address of token 'k' is obtained 
from the data field of the matching deferred 
token ‘i'. The color, iteration count and 
operand type fields of token 'k' are copied from 
the corresponding fields of the deferred token 
‘i'. Depending on the operand type, token 'k' 
1s either put back in the matching unit 
distributor or is sent forward to the node 
store. Thus the ‘signal release’ token releases 
all the deferred read requests generated for the 
memory location 'l'. There are two ways of 
implementing this scheme. One is to provide a 
separate store in the matching unit where only 
deferred tokens are stored. This scheme does 
not affect the search of ordinary tokens but 
effective utilization of the matching unit is 
reduced. The other scheme does not allocate any 
separate memory and is based on the assumption 
that the number of deferred tokens is a minor 
fraction of the total number of tokens generated 
by the program, therefore the effect of the 
deferred tokens residing, in the matching store, 
on the ordinary tokens is insignificant. 


4. DATA STRUCTURES AND DATAFLOW PROCEDURES 


To execute logic programs, the machine has 
to support all the basic data types that are 
used in logic programs. These are constants, 
variables, lists and structures. In fact, list 
and constant types are special cases of 
Structures but are treated separately in 
PROLOG([3]. We also follow the same convention. 
We represent the BE as a list of <variable, 
binding value> pair. Apart from these data 
types, some structures are necessary for 
representing a goal, a context and a binding 
environment. These are goal node, context and 
binding node respectively. They are described as 
Follows 
Goal Node: It is a 2-tuple <<predicate name, 
argument address>, context pointer>. The 
predicate name is used to identify the 
corresponding definition of the predicate in the 
definition search unit. The argument pointer 
points to the array containing the arguments of 
the goal. The context pointer points to the 
last context created. 

Binding Environment: It is a list of binding 
nodes where each binding node is represented as 
a 3-tuple <variable,binding value,next node> 
where variable and binding value fields are used 
to represent the bindings created during 
unification. Next node is used to form the list 
of bindings. 

Context: It is a record like data structure used 
to hold the BE along with other control 
information. Tne following are the fields ina 
context. 

Context number: Each context is identified by a 


unique number. This number is used in renaming > 


the variables of a clause in order to 
distinguish the variables of the clause under 
different invocations. 

Tag: It is a <color, iteration count> pair which 
is used in restoring the tag of a data token 
when it returns fron a clause. At the time of 
the creation of a context, the return tag of the 
token is stored in the 'tag' field. 
Destination: It gives the address in the 
dataflow graph to which the result token should 
return after exiting from the clause. 

Prev context: It is a pointer to the previous 
context. 

First: It is a pointer to the first element in 
the BE . 

Last: It points to the last element in the BE. 
Unify fail: A boolean variable that indicates 
the Status of the environment which can be valid 
or invalid. 


4.1 Handling Multiple Binding Environments 


Jim Crammond[8] has discussed three methods 
for handling multiple BEs for OR-parallel 
execution of logic programs. These are 
Directory tree, Hash windows and Variable 
importation. A modified form of directory tree 
method which is suitable for dataflow 
implementation is employed in our architecture. 
The advantage of our scheme is that no 
environment copying is done. Since we represent 
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the environment as a list of bindings, the time 
required to search for the value of a variable 
is O(n) where n is the number of elements in the 
BE. 

When a clause is invoked, a new context 
which is referred to as present context is 
allocated in the structure memory. The return 
address and the return tag are stored in the 
present context. The arguments of the clause 
head are unified with those of the goal and the 
BE is created. The prev_context field of the 
present context is made to point to the previous 
context. Tne pointer to the previous context is 
obtained from the context pointer field of the 
goal node. When the clause does not have any 
more goals to be solved, the BE of the previous 
context is appended to the BE created in the 
present context and the new BE so created is 
stored in a new context called returm context 
whose other fields are copied from the 
corresponding fields of the previous context. 
The return context 1s returned from the clause 
to the destination indicated in the destination 
field of the present context. The two distinct 
advantages of doing so are, 

1. Search for a variable need be done only in 
on2 context. On the contrary, in case of the 
directory tree tecmigue, search has to be done 
in a list of contexts. 

2. No copying of environment is required at any 
stage, but in the case of directory tree method, 
the uncommitted contexts which contain at least 
On unbound variable are to be copied and passed 
on to the subsequent contexts. 

We illustrate by a hypothetical example how 
environments are managed. Consider a program 
having six clauses given below. 

(1) P:-O,R. (2) Ps-A. (3) QO. (4) Q. (5) A.(6) R. 
The goal is P. We assume that each predicate 
has some arguments. The goal P unifies with the 
Clauses 1 and 2 to produce two contexts cl and 
c2 as shown in figure 3(a). The BE created 
context i is represented as '( bei)'. These two 
clauses 1 and 2 form the two alternative 
solution paths for the goal P. The subsequent 
goals in the two solution paths are 9 and A. 
The goal Q unifies with clauses 3 and 4 to 
generate two more contexts c3 and c4. Both these 
contexts have cl as their predecessor. The goal 
A unifies with clause 5 and creates context cS. 
Its predecessor is c2. The configuration of the 
contexts at this stage is shown in figure 3(b). 
The clauses 3, 4 and 5 ,being unit clauses, do 
not have any sudgoals in their pody, therefore 
the ‘'return' contexts rl, r2 and r3, shown in 
figure 3(c), are returned to the destinations 
specified in contexts ¢c3, c4 and c5 
respectively. The contexts rl and r2 from 
clauses 3 and 4 return to clause 1 and form two 
new goals with the predicate R. The context r3 
from clause 5 returns to clause 2. As clause 2 
does not have any more goal, the BE in that 
context (r3) is returned as one of the three 
solutions as indicated in figure 3(c). The two 
goals R_ corresponding to the two ‘return' 
contexts rl and r2 unify with clause 6 and 
create contexts c6 and c7 as shown in figure 
3(d). Suosequently 'return' contexts r4 and r5 


are returned from clause 6 to ive the other two 
solutions as shown in figure 3(e). Looking 1t 


another way, we find. that during the process of 
execution of a goal, a tree of BE is created 
with each path corresponding to one alternative 
solution. Whenever a goal enters a clause, it 
increases the level of the tree grown so far by 
one. Whenever an exit is made from a clause, 
the level of the tree is reduced by by one. 
Ultimately when all the goals are solved, the 
tree is converted into a set of BEs where each 
is an alternative solution to the goal. 


4.2 Dataflow Procedures 


As the user is relieved of the burden of 
specifying the algorithm, the machine must have 
some built-in mechanisms for controlling the 
execution of programs. In the case of 
sequential PROLOG unification and backtracking 
are built into the machine[3]. Analogously, we 
provide the following bdSuilt-in dataflow 
procedures for execution of logic programs. (1) 
UNIFY (ii) WRITE IN BE (iii)SEARCH (iv) COPY 
ARGUMENTS (v) EXIT (vi)CALL COPY AND ONIFY 
(vii)PASS ARGUMENTS. We describe below these 
procedures in a Pascal-like syntax and give 
explanations wherever necessary. 


procedure UNIFY(tl,t2,ct);/* tl and t2 are the 
two terms to be unified and ct is the context */ 
begin 

if not ct.Unify fail then 


begin 
if tl.dtype=t2.dtype then 
begin 
if tl.dtype='atom' then 
begin 


if tl.val<t2.val then 
ct.unify fails:=true; 
end; 
if tl.dtype='var' then 
WRITE IN BE(tl, t2, ct); 
if tl.dtype='list' then 
begin 
UNIFY(tl.head,t2.head, ct); 
UNIFY(tl.tail,t2.tail, ct); 
end; 
if tl.dtype='struct' then 
begin 7 
if different struct names then 
ct.unify fail:=true else 
begin 
for i:= 1 ton do 
/* nszno. of arguments*/ 
UNIFY(tl.arg(i),t2.arg(i),ct); 
synchronize; 


if tl='"var' then 


else if t2='var' then 
WRITE IN BE(t2, tl, ct); 
else ct.unify fail:=true 
end; 


end; 

return ct; | 
end. 
The procedure UNIFY unifies the two terms and 
calls the procedure WRITE IN BE to write the 
<variable,: value> pair in the BE. 


procedure WRITE IN BE(vl1,dl,ct); /* vli:variable, 
dl: binding value, ct : context*/ 
begin 
if not ct.unify fail then 
begin 
bns— request new binding node; 
bn.variable:-vl; 
bn. binding value:-dl; 
search for vl in BE, 
if vl is present then 
begin 
di':-value of vl already present in BE; 
UNIFY(d1',dl,ct); 


end 

else put bn in the BE; 
end; 

return ct; 
end. 


The procedure WRITE IN BE checks for the 
presence of a variable in the BE. If the 
variable is already present, it performs the 
consistency check on the BE oy unifying the two 
values of the variable, one already present and 
other one the new binding value dl discussed 
above. If the variable is not present, the 
<variable, value> pair is written in BE. 


procedure SEARCH(al, dl, ct);/* al :the address 
of an empty memory location, dl : term and ct : 
the context */ 
begin 
stepl :if dl is ground or env=nil then = write 
dl in address al; 
if dl is a variable then 
begin 
search for dl in BE; 
if not found then write dl in al; 
else 
begin 
new dl:-binding value of dl; 
new ct:-ct; 
new al:-al; 


goto stepl 
end: 
end; 
if dlis a list then 
begin 


al':-request new list node; 
write al' in al; 
SEARCH(al'.head,al.head,ct); 
SEARCH(al'tail, al.tail, ct); 
end; 
if dl is a structure then 
begin 
fl's-request new structure; 
store functor name and aumoer of arguments; 
for i:=l1 to number of arguments do 
SEARCH(fl'.arg(i), dl.arg(i),ct); 
end; 
end; : 
The procedure searches the environment for the 


given term and writes the result of the search 
process in address al. 


procedure COPY ARGUMENTS(al, dl, i);/* al : 
address of a memory location, dl : term and i : 
integer */ 
begin 
if dl is ground then write dl in al; 
if dl is variable then 
begin 
di':=rename(d1,i); 
write dl' in al; 
end; 
if dl is a list then 
begin 
dl':-request new list node; 
COPY ARGUMENTS(d1l'.head, dl.head, i); 
COPY ARGUMENTS(d1'.tail, dl.tail, i); 
end; 
if dl is a structure then 
begin 
fl:— request new structure; 
store functor name and arguments 
for j:=l1 to numoer of arguments do 
COPY ARGUMENTS(fl.arg(i),dl.arg(i), i); 
end; 
end. 
The procedure COPY ARGUMENTS writes the 
arguments into the address location specified. 
This procedure is required because each 
unification is performed on a new copy of the 
arguments. The instruction ‘rename', renames a 
variable by tagging it with a number. The 
cenaming is required to distinguish the 
variables of a clause under different 
invocations. 


procedure exit(ct); 
/* ct is a context pointer */ 
begin 
if ct.prev_context=nil then 
begin 
return ct to the destination specified in its 
destination and set the tag of the return 
token with ct.tag; 
end 
else 
begin 
ct's-request new context; 
copy the four fields context name, tag, 
destination and prev_context of ct' from the 
corresponding four fields of ct.prev context; 
return ct' to the address specified in the 
destination field of ct and set its tag (color 
and iteration count) using tag field of ct; 
if ct.first=nil then 
begin 
ct'.first:-ct.prev_context.first; 
ct'.last:-ct.prev_context. last; 
end 
else 
begin 
ct'.first:-ct.first; 
if ct.prev_context.first=nil then 
ct'.last:-ct.last 
else 
begin 


ct'.last:-ct.prev_context.last; 


419 


ct.last.next_node:-ct.prev-context. first; 


a 


end. 

The procedure EXIT appends the BE of the 
previous context (ct.prev_context) to the BE of 
the present context and creates a return context 
ct' if the previous context is not nil , 
otherwise it returns ct. The return address and 
return tag are obtained from the present 
context(ct). 


procedure CALL COPY AND UNIFY(gl,ct,n,addr,r); 
/* gl : goal node, ct the context, n : number of 
arguments in the head of a clause, addr : the 
address of array of arguments of the head, and 
r : return address */ 
begin 

als-request new array(n); 

a2:-request new array(n); 

ks=ct.context number; 

for i:=l ton do 

begin 

COPY ARGUMENTS(al(i),addr.arg(i), k); 

COPY ARGUMENTS (a2(i),gl.arg addr.arg(i),0); 

/* 0 indicates that renaming is not done*/ 

UNIFY(al(i), a2(i), ct) 

end; 

synchronize; 

return ct; 
end; 
The above procedure invokes unification of all 
the arguments of a clause head with those of the 
goal. The instruction ‘synchronize’ is used to 
wait till all the unifications of the arguments 
are complete which ensures the complete 
formation of the BE. 


procedure PASS ARGUMENTS(al,dl, n, ct); /* al, 
Gl : pointers to arrays, n : integer and ct : 
context */ 
begin 

for i:=l1 ton do 

SEARCH(al(i), dl(i), ct); 
end; 
The above procedure carries out the search for 
each argument of dl in the BE and writes the 
result of the search process in the 
corresponding element of array al. 

A clause is represented as a dataflow graph 
consisting of calls to procedures CALL COPY AND 
UNIFY, PASS ARGUMENTS and EXIT. The dataflow 
graphs for a definite clause and unit clause are 
shown in figure 4(a) and 4(b) respectively. 
The operators 'split goal', ‘form goal’ are used 
for manipulation and creating of the goal node. 
The main feature of the dataflow graphs of the 
clauses is their modularity. The complexity of 
the graphs is independent of the argument 
complexity which enables easy compilation of the 
program clauses. 


5. SIMULATION OF THE ARCHITECTURE 
A simulator for this architecture is 


developed in SIMULA-67 on a DEC-1090 system with 
a view to studying the performance of the 


machine in terms of the speedup of the machine 
and utilization of various hardware units 


namely processor, matching unit, node store unit 
and memory modules. The variables are, number 
o£ processors, number of matching units and the 
number of memory modules. We have simulated the 
Grandparent relationship problem on our 
machine. The problem,though simple, captures 
all the features of a logic program. The 
relation between execution time and number of 
processors for this problem is shown in figure 
5(a). We find that for this particular problem, 
the maximum speedup achievable is 3 with eight 
processors. The variation of utilization of the 
matching unit and the processing element with 
the variation in processing elements is shown in 
figure 5(b). The utilization of the matching 
unit is low (26%) for a single processor and 
reaches a maximum (67%) with eight processors. 
Further simulations of more complex problems are 
under progress. 


6. CONCLUSIONS 


A dataflow machine for executing logic 
programs is proposed. The machine supports OR- 
parallelism and argument parallelism. A new 
scheme for handling deferred read mechanism is 
suggested. The dataflow graphs for the program 
clauses are quite modular and are independent of 
the complexity of the arguments, hence the 
compilation of the clauses is easy. Work is in 
progress to devise better schemes for 
representing the BE with a view to minimizing 
search process. 


ACKNOWLEDGEMENTS 


The authors thank Mr S. Sundaram, Centre 
for Computer Aided Design, Indian Institute of 
Science, Bangalore for helping them in the 
preparation of the manuscript. 


REFERENCES 


{1} Arvind and R.E.Thomas, I-Structures:An 
Efficient Datatype for Functional Languages, 
Technical Memo TM-CSG-174, Laboratory for 
Computer Science, MIT, September 1980. 


{2} J.S.Conrey and D.Kibler, "Parallel Inter- 
pretation of Logic Programming," Proceedings of 
the International Conference on Functional 


Programming and Computer Architecture, 1981. 


[3] D.H.D.Warren, Implementing Prolog-Compiling 
Predicate Logic Programs, D.A.I., Research 


Report 39, University of Edinburgh, 1977. 


[4] I.Watson and J.R.Gurd, "A Practical Data- 
flow Computer", IEEE Computer, Febdruary 1982. 


[5] J.R.Gurd, I.Watson and J.R.W.Glauert, A 
Multilayered Dataflow Architecture, Internal 
Report, Department of Computer Science, 
University of Manchester, 1980. 


[6] J.R.Gurd, C.C.Kirkham and I.Watson, The 


A420 


Manchester Prototype Dataflow Computer", CACM, 
Vol. 28, No. 1, January 1985. 


[7] J.Sargeant and C.C.Kirkham, "Stored Data 
Structures on the Manchester Dataflow Machine", 


Proceedings of the 13th Annual Symposium on 
Computer Architecture, 1986. 


{8] Jim Crammond, "A Comparative Study of 
Unification Algorithms for OR-Parallel Execution 
of Logic Languages", IEEE Transactions on 
Computers, Vol. C 34, No. 10, 1985. 


[9] K.Kawakami and J.R.Gurd, "A Scalable Data- 
flow Structure Store", Proceedings of the 13th 


Annual Symp. on Computer Architecture, 1986. 


[10] N.Ito et al., "The Architecture and Preli- 

minary Evaluation Results of The Experimental 
Parallel Inference Machine PIM-D", Proceedings 
of the 13th Annual Symp. on Computer 
Architecture, 1986. 

{11} N.Ito et al., "Dataflow Based Execution 
Mechanisms of Parallel and Concurrent Prolog", 


New Generation Computing 3(1985). 


{12] R.Hasegawa and Makato Amamiya, "Parallel 
Execution of Logic Programs Based on Dataflow 
Concept", Proceedings of the International 
Conference on Fifth Generation Computer Systems, 
1984, 


[13] R.Hasegawa et al, "An Architecture for 
List-Processing-Oriented Dataflow Machine", 
REVIEW of the Electrical Communication 
Laboratories, Vol. 32, NO. 5, 1984. © 


{14] S.Umeyama and K.Tamura, "A Parallel 
Execution Model of Logic Programs", Proceedings 


of the 10th Annual Symposium on Computer 
Architecture,1983. 


[15] Zahcin Halim, "A Data-Driven Machine for 
OR-Parallel Evaluation of Logic Programs", New 


Generation Computing 4(1986). 


NODE STORE 


ARBITRATOR 


MATCHING UNIT 


DISTRIBUTOR 


PE: Processing element 


SWITCH 


THE MANCHESTER RING 


FROM HOST 


TO HOST 


FIG .1 


DEFINITION 
SEARCH UNIT 


PE - Processing element 
MU = Matching unit 
SM - Structure memory 


TO HOST 


FROM HOST 


FIG. 2(a) ODATAFLOW ARCHITECTURE TO EXECUTE LOGIC PROGRAMS 


DATAFLOW 
GRAFH 
ADDRESSES OF 
THE CLAUSES 


IN A DEFINITION 


Clause address 
memory 


Definition search 
memory 


FIG. 2(b) DEFINITION SEARCH UNIT 


Goal token Goal token 
w 
SPUT GOAL) ¥ SPLIT GOAL Jw x 
oz wo 2 
Boa uy 3 
REQUEST oO 5 9 
o 2 Po 
CONTEXT “ oe 


INVOKE 
CALL COPY 
AND UNIFY 


INVOKE 
CALL COPY 


AND UNIFY 


P is the goal ee 


c2 


(bet) (be 2) 


cl 


(a) UNIFICATION WITH CLAUSES 1 AND 2 


A is the goal 


Q is the goal 


= be 2) 
c2 04 
05 Co (be5) 
(b) UNIFICATION WITH CLAUSES’ 3,4 AND 5 
(be 5)( be 2) 
rf (be3)(bel) (4 (be 4)( be 1) 13 e 
solution 1 
(c) RETURN FROM CLAUSES 3,4 AND 5 
R_ts the goal R_s the goal 
a (be 3)( bel) ft 4)( bel 
aoe aS -2 = be 4)( be 1) 
[~~ (be 6) —— be 7 
c6 LU rah 2 ee eae 
(d) UNIFICATION WTH CLAUSE 6 
(be 6) (be3)( be1) (be 7)( be4)( bel) 
rh r5 
solution 2 solution 3 
(e) RETURN FROM CLAUSE 6 
FIG.3 MULTIPLE BINDING ENVIRONMENTS IN OR- PARALLEL EX ECUTION 


OF LOGIC PROGRAMS 


1 TIME UNTT = 10 nanosecs 


GRANDPARENT RELATIONSHIP 


gpar (X,Y) :— por (X,Z), par (Z,Y) 


par ( john, mary) 
par (mary jill) 
«-— gpar ( jonn,Z } 


EXECUTION TIME ( Thousands) 


w”“ 
wn 
REQUEST £ 
NEW ARRAY 3 
o 
Q n & D PROCESSOR 
© MATCHING UNIT 
M GOAL 
IVOKE FIG.4(b) OATAFLOW GRAPH 80 
PASS FOR A CLAUSE PIX,Y) 
ARGUMENTS en 
4 x 
REQUEST 3 = 
NEW ARRAY 3 = 40 
on 
. " {6 n : Number of arguments 
arg address : pomter to argument 20 
FORM GOAL INVOKE array 
PASS ret address : return address 3 7 Ti 15 
ARGUMENTS NUMBER OF PROCESSORS 
EXIT 
(b) 
FIG 5 (a) EXECUTION TIME vs. NUMBER OF PROCESSORS 
FIG.4(a) DATAFLOW GRAPH FOR A CLAUSE (b) UTILIZATION vs. NUMBER OF PROCESSORS~ 


P(X,Y) :—Q{Xx, Z), R(Z,Y) 


421 


Storage Schemes for Efficient Computation of a Radix 2 FFT ina 
Machine with Parallel Memories 


D. T. Harper II and D. A. Linebarger 


Erik Jonsson School of Engineering and Computer Science 
The University of Texas at Dallas 
Richardson, Texas 75083-0688 
(214) 690-2974 


Abstract | 


Efficient bit-reversed access of vectors is an important considera- 
tion in designing architectures for use in signal processing applications. 
In particular, this type of access occurs in radix 2 FFT algorithms. 


In this paper two skewing schemes which permit efficient bit- 
reversed access are discussed and compared in the context of a simple 
computer architecture such as might be designed around a low-cost, 
high performance specialized DSP chip. Performance measurements 
are shown for each scheme and simple address generation hardware is 
presented. 


Introduction 


Efficient algorithms for computation of the discrete Fourier 
transform have been studied intensely over the past 25 years. Most of 
these investigations centered around attempts to minimize the arith- 
metic complexity of such algorithms. The motivation for this direction 
was the fact that the time consuming operations in the algorithms were 
the arithmetic ones. However, improvements in VLSI processing and 
architectural optimization now permit extremely fast arithmetic opera- 
tions to be performed. Current commercial multipliers are capable of 
sub 50ns operation on 32-bit data and all indications are that that figure 
will continue to fall. One effect of these advances is that the process- 
ing bottleneck has moved out of the arithmetic unit and into the 
memory. The limiting factor is now the rate at which data can be 
transferred to the arithmetic unit. 


There are several approaches to alleviating this bottleneck. The 
first approach is to use faster memory. The same technological 
advances that achieved fast arithmetic circuits also achieved fast 
memory circuits. The disadvantage of this approach is that for large 
amounts of memory the cost becomes prohibitive. Other concems are 
the problems of dissipating the large quantities of heat generated by the 
fast memory devices and the additional space required by the lower bit 
densities of the fast memories. 


The second approach is to use a cache to achieve an apparent 
memory cycle time which is lower than the cycle time of the main 
memory. Disadvantages of this approach are the added complexity of 
hardware to implement the cache and the added complexity of the 
software or hardware to manage the cache. 


__ A third approach, and the one pursued here, is to use parallel 
banks of slower memory, each of which can operate independently, so 
that overlap between multiple banks, or modules, creates an effective 
memory cycle time which is low enough to support the data rates 
required by the arithmetic unit. This technique is particularly applica- 
ble to systems which perform vector operations. The disadvantage of 
using parallel memory architectures is that severe performance degra- 
dation results if references are directed to modules which are busy pro- 
cessing prior references. This event is known as a memory collision. 
The rate at which collisions occur is determined by three factors: which 
data items are being referenced, the temporal order in which these 
items are accessed, and how the items are distributed over the parallel 
modules. Since the first and second factors are usually determined by 
the particular problem being solved it is useful to focus on the third 
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factor. 


The issues of determining a storage policy to allow efficient 
access to vectors using strides common to matrix operations and of 
how much performance degradation occurs when collisions do occur 
have been considered by several authors [1,2,3,4,5,6] with most 
conflict-free systems being based on a prime number of modules. 
Harper and Jump [7] have also considered the performance implica- 
tions of using a composite number of modules. Melton and Norton [8] 
have proposed a storage scheme to solve the problem of vector 
accesses with strides equal to powers of 2 (S = 2°) for memory systems 
with a power of 2 number of modules (V = 2”). 


In this paper accessing patterns common to a typical Cooley- 
Tukey radix 2 FFT algorithm are considered. The Cooley-Tukey radix 
2 FFT operates on a vector with a length equal to a power of 2 and 
accesses the input data with strides equal to powers of 2. The perfor- 
mance of the proposed storage schemes with the FFT access patterns is 
discussed in the context of a simple architecture which does not have 
an expensive parallel interconnection network such as a multistage net- 
work or a crossbar. 


Architecture 


In the architectural model considered in this paper, computation 
is performed by a processor which is assumed to be capable of process- 
ing data at the bus rate. Equivalently, the system bottleneck is 
assumed to be caused by access conflicts in the memory. This is not an 
unreasonable assumption given the speed of current hardware and the 
prevalence of pipelining in execution units. This architecture differs 
sharply from the system considered in [8] which employed a highly 
parallel interconnection network. 


One important feature required of the processor is a decoupling 
of the data fetch and execute cycles. This allows the fetch hardware to 
generate a stream of addresses to the memory independently of the 
operation of the execution unit. After some delay, the data referenced 
by the address stream is returned to the execute unit of the processor. 
These two tasks operate asynchronously with each other. The data- 
independent nature of vector accesses in the FFT algorithm makes this 
decoupling advantageous. Each memory module is assumed to have a 
bus interface register so that a reference transmitted to a memory does 
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Figure 1: Butterfly 


not require the use of the bus for the entire memory cycle (subsequent 
transmissions can be overlapped with the memory cycle). 


FFT Algorithms 


The fast Fourier transform (FFT) is a computationally efficient 
way of computing the discrete Fourier transform - an operation that is 
often performed in signal-processing applications. The focus of this 
paper is on implementation of the standard radix 2 Cooley-Tukey FFT 
[9,10] on a machine with parallel memories. Although FFT’s have 
been developed that are faster than the radix 2 FFT, the radix 2 FFT is 
still widely used. It is assumed that the input sequence is of length 
be 2 

The fundamental operation in any FFT is known as a butterfly. A 
radix 2 butterfly has two inputs and two outputs. The butterfly compu- 
tation consists of one complex multiplication and two complex addi- 
tions as shown in Figure 1. Each node represents a complex addition 
with any indicated negations. Wf is notation often used with FFT’s 
and is a symbol for exp (—j2nn/L). Wf is a multiplicative constant for 
the branch it appears by. Each stage of a radix 2 FFT consists of L/2 
butterfly computations with the entire computation requiring logol 
stages. The entire FFT is a sequence of butterfly computations. Each 
butterfly has two inputs and two outputs, but for the standard imple- 
mentation of the Cooley-Tukey radix 2 FFT, the separation between the 
two inputs (and the two outputs) varies from one stage to the next. 
Hence, the accessing pattern is variable. 


The most commonly used algorithms for the radix 2 FFT are in- 
place. The in-place algorithms have their input vectors in order and 
their output vector is produced in a scrambled order, or vice-versa. Fig- 
ure 2a shows an algorithm where the input is in order. There are two 
important observations to be made concerning Figure 2a: 


(1) The output points for each butterfly are adjacent to its input 
points. This implies that the implementation can be calculated 


in-place. 


(2) The separation between the input (and output) points for each 
butterfly decreases in each successive stage of the FFT. This 
implies that the accessing pattern is different from one stage of 


the FFT to the next. 


The second characteristic of the in-place implementation of the radix 2 
FFT makes it difficult to develop an efficient algorithm for storing the 
input data across multiple memories. 


The nodes at a given stage of the FFT can be reordered without 
changing the function of the FFT as long as none of the connections are 
changed and the multiplicative constants move with the original con- 
nection with which they are associated. Thus a rearrangement of the 
node ordering for the flow graph in Figure 2a can be considered which 
might be more efficient for an architecture with parallel memories. The 
implementation illustrated in Figure 2b is known as a constant 
geometry algorithm since all stages of the FFT have the same 


N © O8F #& WD DY —|+ OD 
N OO OF fF WB DBO =| SO 


Figure 2: FFT flow graphs 
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connection pattern. The same computations are performed by the 
FFT’s illustrated in Figures 2a and 2b; only the order of computation 
has changed. The input and output vectors of the constant geometry 
FFT are in the same order as those for the in-place FFT only the inter- 
mediate nodes have been reordered. However, it should be noted that 
the constant geometry implementation of the FFT cannot be calculated 
in-place. 


Since each stage of the constant geometry FFT is identical the 
number of accessing patterns to be considered is reduced. The butterfly 
inputs and outputs are accessed in pairs; on the output the pairs consist 
of consecutive, adjacent pairs (stride 1). The input pairs consist of ele- 
ments separated by half the sequence length, L. At the last stage, the 
output appears in scrambled order. 


Storage Schemes 


Vector accesses in the constant geometry FFT algorithm [9] con- 
sist of three different patterns. One is the stride 1 access which is 
easily handled by interleaving the addresses across the modules. The 
second is consecutive pairs separated by half the sequence length. The 
third access pattern is the pattern referred to in the previous section as 
"scrambled". The pattern is not truly random; it is a "bit-reversed” pat- 
tern. In this pattern the sequence of elements required is equivalent to 
the binary numbers formed by reversing the order of the bits of a 
binary counter. The most significant bit of the counter becomes the 
least significant bit of the address, the least significant bit of the counter 
becomes the most significant bit of the address, etc. Figure 3 shows the 
order of elements required by a bit-reversed access of a length 16 vec- 
tor. Unlike constant stride accessing patterns, bit-reversed accesses are 
dependent on the length of the vector being accessed. In-place FFT 
algorithms require bit-reverse access for vector lengths of all powers of 
2 less than or equal to L. The constant geometry implementation only 
requires bit-reverse access for length L and only performs this type of 
access after the final stage of computation is completed. 


Stride 1 0000 0001 0010 0011 0100 


(0) (1) (2) (3) (4) 


0000 1000 0100 
(0) (8) 4) 


1100 0010 
(12) (2) 


Bit-Reversed 


Figure 3: Stride 1 and Bit-Reversed Access Patterns 


It has been recognized by several authors [11] that generating 
bit-reversed sequences of addresses under software control is prohibi- 
tively expensive. To reduce the penalty for these accesses it has been 
proposed that hardware support for bit-reversed address generation be 
added to the memory system [12] . While this improves performance 
by removing the address generating task from the software, only part of 
the problem is solved. The problem of memory contention in bit- 
reversed accesses has not been considered. 


To address the problem of memory collisions in bit-reversed 
accesses the vector storage scheme of the system must be considered. 
It has been noted that interleaved schemes based on low-order inter- 
leaving perform well only when the access stride is relatively prime to 
the number of memory modules. Referring to the bit-reversed 
sequence in figure 3, it seems as though there is no fixed stride 
involved. However, the sequence can be viewed as the concatenation 
of L/2 length 2 vector accesses each of which has a stride of L/2 but 
differs in their starting address. Thus, there are two accessing patterns 
involving pairs separated by L /2, the difference being the order that the 
pairs are accessed. Since the number of modules, N = 2”, is a power of 
2, and since the length of vectors in FFT algorithms is often also a 
power of 2, L =2', system performance can be degraded due to the 
effects of memory collisions when low-order interleaving is used. 


A more desirable storage scheme therefore must provide for both 
stride 1 and stride L/2 accesses. If L is restricted to be a power of 2 


then Norton and Melton have proposed such a scheme based on a set of 
boolean transformations. For a system with N =4 and L = 16 then 
their storage scheme maps the elements of the vector into the modules 
as shown in Figure 4. 


Mo M, M2 M3 
0 3 2 1 
5 6 7 4 

10 i) 8 11 

15 12 13 14 


Figure 4: Norton/Melton Storage Scheme 


Under this scheme all power of 2 stride accesses can be made in a 
conflict-free manner. Although they do not explicitly state the ability 
of their scheme to provide conflict-free access to bit-reversed patterns 
it is clear that the capability is present. These statements are true as 
long as the architecture has a parallel interconnection network between 
the processor and the memory so that an address can be delivered to 
each module simultaneously. In the architecture considered here that is 
not the situation. Since only a single address is delivered to the 
memory in each time period contention can occur due to references to 
the same module in a sliding window of N references rather than in a 
fixed window of N references. Figure 5 uses the sequence of module 
addresses generated for a stride 1 access in a 4 memory system with the 
Norton/Melton storage policy to show collisions due to the lack of a 
parallel interconnection network. Unfortunately, the scheme proposed 
by Norton and Melton produces multiple conflicts when used with a 
- sequential network. An alternative scheme for storing vectors accessed 
in bit-reversed order can be constructed as follows. Note that this 
scheme does not eliminate conflicts but serves to reduce the frequency 
of their occurrence compared to the scheme of Norton and Melton. 


Begin by dividing the vector into N parts each of length L/N. 
Elements 0,1,....L/N~—1 are placed in module 0. Elements 
LIN ,LIN +1,...,2L/N —1 are placed in module 1, etc. This storage 
scheme, shown in Figure 6a, permits conflict-free access for consecu- 
tive L /2 pairs and reduced conflict access for a bit-reversed pattern but 
does not permit efficient stride 1 access. To obtain the stride 1 access it 
is sufficient to skew each row by 1 module from the preceding row. 
The resulting patterns are shown in Figure 6b. 


Performance Comparison 


To evaluate the relative performance of the two schemes several 
simulation experiments were performed. Using a discrete-event simu- 
lation package a model of the architecture was: developed. In the 


Parallel Network: — conflict occurs if a single module is referenced 


more than once in each group of 4 references. 


0321 3012 2103 1230 :no conflicts 


conflict occurs if successive references to a 
module are separated by fewer than 3 references. 


Serial Network: 


0321 3012 21083 : several conflicts 


T Tf 


1230 


Figure 5: Sequence of Module References To Demonstrate the Effects 
of the Interconnection Network - N =4 
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performance measurements it was assumed that the bus cycle time, ft, 
was matched to the memory cycle time, t: t,=N-t. Also, timing 
was normalized to the bus cycle time (¢, = 1). The input to the simula- 
tions was the sequence of module addresses required to fetch the par- 
ticular vector elements. The simulation measures memory system 
throughput, TP, as a function of the sequence of module addresses. 
Figure 7 compares the performance of the scheme proposed by Norton 
and Melton with the scheme proposed here. The solid lines indicate 
performance for stride 1 accesses. The dotted line measures 
throughput for a bit-reversed access under the proposed scheme. Bit- 
reversed access performance under the RP3 scheme perform identically 
to stride 1 accesses. The graphs indicated by the dashed lines represent 
performance on stride L/2 accesses. It is clear that for the three access 
patterns and the architecture considered here that the proposed scheme 
leads to better memory performance. It should also be noted that the 
proposed scheme requires L >N? to distribute vector elements 
correctly. This is not viewed as a strong constraint since in architec- 
tures similar to the one discussed here N is typically on the order of 4 
to 16. Larger values of N are not required since the bus quickly 
becomes the bottleneck of the system. These values of N do not 
require a particularly large value of L. 


Mo M, M> M3 
0 4 8 12 
| 5 9 13° || 
2 6 10 14 
3 7 11 15 

Figure 6a 

Mo M, M2 M3 
0 4 8 12 

13 1 5 9 

10 14 2 6 
7 11 15 3 

Figure 6b 


Figure 6: Proposed Storage Scheme 
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Figure 7: Performance Comparison 


Other simulations have been performed by varying the value of 
N used. Results of these simulations are similar to those presented in 
Figure 7. 


Address Generation 


For any storage scheme to be practical the question of address 
generation must also be considered. Norton and Melton devised an 
elegant method for implementing their address mapping. In this sec- 
tion a simple method is demonstrated which generates the address of 
the k* element of a vector access (either stride 1 or bit-reversed). The 
hardware required to perform the mapping of & into the address of the 
appropriate vector element is shown to be inexpensive. In the follow- 
ing discussion the length of the vector is given by L =2! and the 
number of memory modules is given by N = 2”. 


Equations (1.r) and (1.m) specify the row and module address of 
the k** element referenced during a stride 1 access. The module 
address is the number of the module being referenced; the row address 
is the location of the reference within the module. 


ri(k) =k mod <, (1.2) 


k ¢ mod N (i.m) 


m,(k ) “|x 


Let bits of a value be numbered from 0 at the least significant position 
and let x;.; represent bits i through j of x. By using the fact that L 
and N are both powers of 2 equation (1.m) can be rewritten as: 


mi(k) = [on thin] mod N 


Equations (2.r) and (2.m) specify the addresses for the bit-reversed 
access pattern. BR (x ) indicates the value of x after reversing its bits. 


Tor (Kk) = BR (kast-1) (2.1) 


my (k) = (BR (kom )+BR (nas) } mod N (2.m) 
Circuits which compute the values of mi(k ), r1(k ), mor Ck ), 
and 7,,(k ), are shown in Figure 8. Blocks labeled reverse perform 
bit-reversal on their inputs. This is achieved simply by mapping a per- 
mutation of the input bits to the output bits. The additional hardware 
required to implement the storage scheme consists of two n bit adders. 


counter 


ko:1-1 


Kn:t-1 Kon-1 


br ,O:n-1 
ri(k ) my(k ) Tor(k ) Mp (K ) 
Figure 8: Address Generation Hardware 
Conclusions 


To summarize, advances in technology have permitted the fabri- 
cation of extremely fast arithmetic units. This has served to move the 
bottleneck of computationally intensive problems, such as the FFT, 
from the ALU to the memory. One solution to this problem is to 
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provide parallel memories so that overlap can occur between succes- 
sive references if the references are directed to different modules. 


For FFT algorithms designed for an architecture with a parallel 
memory system the constant geometry version of the radix 2 FFT will 
better utilize memory bandwidth. The proposed storage scheme allows 
for fast memory access in the patterns required for the constant 
geometry radix 2 FFT. This may be of particular importance in real- 
time applications. 


The analysis was based on an architecture with a low-cost, serial 
interconnection network. Under these constraints it was shown that the 
proposed storage scheme provided better memory performance than 
the scheme proposed by Norton and Melton. The problem of generat- 
ing addresses was also considered and it was shown that a simple cir- 
cuit based on two adders is capable of generating the proper address 
sequence. 
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Abstract -- Distributed Instruction Set Computer 
(DISA) is a multiple functional-unit computer system. It 
employs a new architecture, Distributed Instruction Set 


Architecture (DISA) to explore the execution parallelism at 
the instruction level. DISA expands data flow concept to 
combine the operation and execution information together. The 
execution information, such as data dependence, is detected 
with a post-compiler and attached to the opcode. DISA 
instructions are self-contained execution units which can be 
executed independently with one another in multiple 
functional units. This alleviates the performance bottleneck in a 
conventional multiple functional-unit system which uses a 
large associate memory table to decode instructions. DISC and 
DISA together have demonstrated a new and efficient way to 
incorporate data flow concept into von Neumann computer 
architecture. 


Introduction 


The current thinking in speeding up the instruction 
execution in a multiple functional-unit system is to apply 
concurrent instruction issuing" [1], "out-of-order execution" 
[2] or "branch prediction" techniques[3]. Each techniques has 
run into some difficulties when itis implemented in a von 
Neumann type instruction set. For instance, we need a large 
table to support a high degree of instruction issuing. The data 
tag search in a large table has contributed to a lengthy clock 
cycle. For the out-of-order execution, the system needs to 
repair the side effect of the execution whenever it encounters 
an exception[4]. This repair work becomes an overhead which 
slows down the processing speed. Our investigation into these 
techniques has shown the problem is not caused by the 
techniques but by the von Neumann type instruction set 
architecture. A von Neumann type instruction set architecture 
is designed for a sequential execution in a uni-processor 
system. It separates the operation information from the 
execution information. The operation information, such as 
opcode and operand, is given in the source code. The machine 
decodes the instruction to determine the execution information, 
such as data dependencies in the run time to execute the 
instruction. It is this run-time execution information detection 
that puts a tremendous burden on hardware design. It becomes 
a performance bottleneck when we intend to execute multiple 
instructions at the same time. A direct solution to this problem 
is to give up the von Neumann type instruction set architecture 
for a multiple functional-unit machine. 


In this paper, we introduce the concept of Distributed 
Instruction Set Architecture(DISA) to solve this problem. We 
first describe the concept of DISA. We then develop a generic 
hardware system based on DISA. The system which we call 
Distributed Instruction Set Computer(DISC) is modeled with 
software to study its performance. We report our initial 
evaluation of DISC by running several benchmark programs 
on the model. 
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Distributed Instruction Set Architecture 


There are three steps to issue an instruction : 1) fetch the 
instruction; 2) detect status of the execution unit; 3)decode and 
issue the instruction for execution. The concept of DISA is to 
speed up this procedure by eliminating the first two steps in a 
multiple functional-unit environment. We use software 
techniques to detect data dependence among instructions 
during compiling time. A post-compiler is used to detect the 
data dependencies among the instructions. It converts data 
dependence in the instruction into a data tag which shows the 
number of cycles the data needs to become "mature" in this 
instruction.The data tag is then attached to the instruction. By 
doing this, an instruction which is fetched from the memory 
can be immediately issued to a functional unit for execution. It 
is the responsibility of each functional unit to check the data 
tag before it pursues execution. This is the same concept as a 
tagged-token data flow architecture[5]. However, we detect 
the data dependence with a software which eliminates a 
lengthy table search in the conventional data flow approach. 


An example is given in Figure 1 to explain the idea. In 
Figure 1, R3 has a dependence between the first and the 
second instruction. In DISA, we assign R3_ in the first 
instruction with a 0 tag, and R3 in the second instruction with 
a 1 tag. When the first two instructions are sent to two 
functional units for execution in the same cycle, the second 
functional unit finds R3 with a non-zero tag. It decreases the 
tag by one and waits for one cycle. In the next execution 
cycle, it re-checks tag and finds a zero tag, it then proceeds to 
complete the execution. For an instruction that has unknown 
status during compiling time, such as memory instruction and 
conditional dependable instruction, we assign a flag bit to the 
instruction. The flag indicates that a conditional bit is required 
to be checked before the execution. When a functional unit 
receives the instruction, it checks the conditional bit in addition 
to the data tag checking. It only executes the instructions when 
both are satisfied. An example is R7 in instruction 4 and b. 
Both depend on the outcome of instruction 3. 


1.Add R1,R2,R3 /*R1+R2->R3*/ 

2.Or R3,R4,R5 /*R3 OR R4 ->R5*/ 
3.CBra,= R5,0,#b /* jump to #b if RS=0*/ 
4.Sub R4,R6,R7 /* R4-R6->R7*/ 

b.Sub R1,R2,R7 /*R1-R2->R7*/ 


Figure 1: Data dependence among instructions. 


The control flow of DISA is shown in Figure 2. A 
DISA processing cycle consists of a transmission sub-cycle 
and an execution sub-cycle as shown in Figure2a. The 
instruction dispatcher pre-fetches instructions. It then sends n 
instructions to n functional units in the forward routing phase 
of the transmission sub-cycle. A free functional unit accepts a 
new instruction. It checks the data tag and conditional bit in the 


instruction during the checking phase of the execution cycle. 
If tag shows an executable status, the functional unit pursues 
the execution in the second phase. Otherwise, it decreases the 
tag by 1 and idles for one cycle. In the next cycle, this 
functional unit refuses any new instruction and repeats the 
checking on the same instruction until it finishes the execution. 
Meanwhile, the unaccepted instructions are automatically 
routed back to the instruction dispatcher during the second 
backward sorting phase of the transmission cycle. They will 
be re-tried in the next transmission. 


J} 


Transmission sub-cycle Execution sub-cycle 


Phase 2: ! 


Backward sorting Phase 1: Phase 2: 
Tag checking Execution 


Phase}: 
Forward routing 


Figure 2a: DISA Processing Cycle 
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ID: Instruction Dispatcher. 
yeS FU: Functional Unit. 
INET: Instruction Network. 
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new_ instruction 
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Data 
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instruction 


Fiugre 2b: DISA System Control Flow Diagram 


Data 
Updating Phase 


Instruction Set Format 


DISA is a register intensive instruction set architecture 
with load and store as the only memory access instructions. 
Each register is associated with a data tag which shows the 
dependencies with other instructions. The instruction format 
is affected by the structure of its targeted machine. In our 
research, we characterize a multiple functional-unit DISC 
system with four factors, [c,n,m,b]. c is the number of 
execution cycle per instruction, nis the number of functional 
units in the system, m indicates the number of memory ports 
and b is the level of branch prediction. 


DISA has three instruction formats: one operand with 
long immediate, two operands with short immediate and three 
operands. Each operand in the instruction is assigned a tag, as 
shown in Figure 3. The tag(TAG) is an i-bit field. In a 
[1,n,m,b] system, the relation is i = log2(n). A dynamic 
tag(DTAG) field is defined to handle the dynamic flow 
information. Each bit in the DTAG corresponds to a 


conditional flag. It requires the functional unit to check the flag 
in addition to all the TAG fields. A good post-compiling 
detection algorithm is the key in this idea which is described in 
the following section. 
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Figure 3a: 3-operand tagged instruction format 


traci] op2 | tAcd) mamepiare] DTAG 


Figure 3b: 2-operand tagged instruction format 


Figure 3c: 1l-operand tagged instruction format 


Post-compiling Algorithm 


The post-compiler is a piece of software. It reads the 
assembly source code, decodes the instruction and generates 
the data-tag DISA instructions for a specific DISC [c,n,m,b] 
system. An "active window" algorithm is used in calculating 
the data tag. The size of an active window, w, is defined as the 
number of instructions which can be active at one instance in 
the system. The post-compiler checks one window at a time to 
calculate tags. In the worst case, it needs to check p-n+1 
windows for a program with p instructions in a [1,n,m,b] 
DISC system. A general post-compiling algorithm is shown in 
the Algorithm 1. A detailed description is in another report[6]. 


Algorithm 1: Post-compiling scheme 

1) Scan one instruction. If it is a load instruction, we 
adjust its memory ports. If it is a branch-target 
instruction and not the first instruction in a new 
window, we adjust the instruction sequence to make it 
as the first instruction and jump to step (4). 

2) Calculate TAG and DTAG. 

3) Repeat steps 1 to 2 to scan a new instruction until we 

fill the current window. 

4) Calculate the instructions for next window. 

5) Repeat steps 1 to 3 to finish a new window. 

6) repeat step 1 to 5 to finish the program. 


Distributed Instruction Set Computer 


Based on DISA developed above, we propose a DISC 
system with 8 functional units. A [1,8,4,1] DISC system is 
shown in Figure 4. A processing cycle starts with the 
Instruction Dispatcher(ID) to fetch multiple instructions from 
the system instruction cache. ID cooperates the instruction 
issuing with the instruction network(INET) to send n 
instructions to n functional units(FU) every cycle. A free FU 
accepts a new instruction and checks the data tag. If the tag 
shows an operand is not ready at that moment, the FU 
updates the tag and holds the instruction for one cycle. 
Meanwhile, INET routes rejected instructions back to ID. In 
the next transmission cycle, the busied FU rejects a new 
coming instruction and checks the tag again. It repeats the 
Same sequence until the operand is ready. It then fetches data 
from Register File(RF) through data network(DNET) and 
executes the instruction. If it is an ALU instruction, FU stores 
the result back into RF and becomes available to accept a new 
instruction in the next transmission cycle. 


Instruction Dispatcher 


ID is an interface logic between CPU and the instruction 
cache. It fetches multiple instructions from instruction cache 
and issues them to FUs through INET. 


Instruction Cache 
Cache Controler 


Instruction Dispatcher 
CHO Os @s Ose ( tid CH CHO 


Instruction Network 
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Common, 
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Data Network 


Read Only 
Register 


Cache Controter 
Date Cache 


Figure 4: DISC[1,8,4,1) 
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Register File Register Register 


System Organization 


Instruction Network 


INET is a multistage n by n, circuit switching, 
synchronous control network. The INET with n=8 is shown 
in Figure 5, Each 2*2 switch element has a third control port 
to the elements above and below it. A built-in logic enables the 
upper output to have a higher priority than the lower one. A 
switch element always tries to send an input instruction to the 
upper output port if it is available. The element in the last row 
has an internal buffer to hold blocked instructions. 


The INET provokes an automatic routing scheme that 
the network routes the instructions by itself to minimize the 
transmission overhead. It employs a three-step routing 
scheme. The first step is to detect the status of FU and set up 
the forward routing paths for each channel. A FU raises its 
input port low or high to indicate that it is able or unable to 
accept a new instruction in this cycle. The switch element next 
to it raises both input ports and control output high if it senses 
a "disable" FU. Then, each switch element detects the status of 
its previous elements and the element above it to set up its 
switch pattern. If both its outputs are high, it raises both input 
ports and control output high. It raises the lower input port and 
control output high, if only its control port is high. A high port 
indicates the path is blocked. An element with both its outputs 
high is unable to receive and transmit any instruction. In the 
second step, INET sends the instructions from ID to FUs by 
following the set-up. physical paths. In the third step, INET 
routes un-accepted instructions which are blocked in INET 
back to channels of ID. 


Functional Unit 


A FU has a 32-bit ALU to provoke a three-stage 
pipelined execution. It fetches two data, executes the 
instruction and stores the data within one cycle. 
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Figure 5: INET with n=8. 


Data Network 


DNET is a 32 by 8 cross-bar network. It transmits the 
data between RF and FUs. It supports one-to-one and 
many-to-one transaction for every execution cycle. 


Common Control Bus 


The common control bus(CCB) provides the inter-unit 
communication among FUs. It is used to support any running 
time information needed to execute the instructions, such as an 
exception or interrupt. An interrupt or exception is detected 
and reported to ID through CCB by a FU. When ID receives 
an exception report, it holds the instruction issuing and sends 
Status report instruction(SRI) to each FU to ask a status report. 
A FU executes SRI to report its status, either complete or 
incomplete. When ID sees an "incomplete" status report, it 
needs to repair the system to a re-startable point before it 
issues the exception handler instructions. The overall flow 
chart is shown in Figure 6. 


An exception reported to ID; 
ID stops instruction issuing. 


ID sends SRI to each FU, and 
asks status report via CCB. 


mstruction completed ? 


No 


Yea 


Ropair the system to a 
re-startable point. 


Each FU sends its incomplete 
instruction back to ID. 

ID latchs all the “incomplete” 
instructions and marks it as 
re-startable point. 


Start the exception handling 
procedure. 


Figure 6. DISC Exception Processing Flow Chart. 


No repair work. ID marks the 
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Evaluation 

Four programs are written in a generic DISC assembly 
language. The first program, Matrix, calculates matrix addition 
and subtraction on two linear arrays. Each array has 100 
elements. The second program, Matrix-I, is the same as Matrix 
but it is optimized for DISC system. The optimization 
techniques are: 1) unfold the loop, 2) register renaming and 3) 
instruction re-ordering. The third program, Bubble, is a 
non-numerical one which uses bubble sort algorithm to sort 10 
elements of a linear array in an ascending order. We only sort 
10 elements for an easy tabulation with other programs. The 
fourth program, Salesman, is a traveling salesman problem. It 
finds the minimal distance for a salesman who visits 6 cities 
once and returns to the starting city. 


Each program is traced in a DISC system with four 
configurations; [1,1,1,1], [1,2,1,1], [1,4,1,1] and [1, 8,1,1]. 
Since we insert NOOP(no- -Operation) instructions into the 
program during the compiling time, Table la shows the size 
of program change in different system configuration. A 
[1,1,1,1] DISC system is equivalent to a conventional von 
Neumann type machine. Table 1b shows a gradually 
performance improvement when the number of FU increases. 
During the trace analysis, a non-memory instruction takes one 
cycle and a memory instruction takes two cycles. The impact 
of memory latency is neglected by assuming no cache miss or 
TLB miss. It will be studied in the future. 


Table 1a: DISC programs size in number of instructions Table 1b: DISC program execution time(cycles) and, 


_ execution time of n=] 
Performance ratio = 


execution time of n=2,4,8 


number of execution cycles 
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$ Salesman 
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number of FU 
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Figure 7a: DISC programs excution time Figure 7b: DISC programs performance ratio 


Figure 7a shows the total execution cycles for each four 
programs in a DISC system. Figure 7b shows the equivalent 
performance gain. The performance gain is not linearly 


proportional to the number of functional units in the system. 
An 80% improvement is gained for numerical programs and 


429 


60% for non-numerical programs. The numerical programs 
contain few control instructions and many computation 
instructions. They tend to have a better performance 
improvement than non-numerical programs. However, 
non-numerical programs show better performance 
improvement than numerical programs in a DISC[1,8,1,1] 
system. This is because non-numerical programs tend to have 
multiple branch instructions which can only be explored with a 
large number of functional units. 


Conclusion 


The concept of DISA is to apply the concept of data 
flow to a multiple functional-unit machine. It modifies von 
Neumann instruction set architecture to combine the operation 
and execution information together. Software techniques are 
applied to coordinate the data dependencies among the 
instructions. It minimizes the hardware complexity and system 
overhead in a distributed execution environment. These all 
together make DISA an ideal candidate for a distributed, 
parallel processing, multiple functional-unit machine. A 
generic system DISC has been proposed to study various 
aspects of DISA. The instruction streams and data streams are 
shown to be the most critical components in DISC. We invent 
INET which maintains a constant instruction stream into the 
multiple-FU engine. The regular cell and simple routing 
scheme have made INET very attractive in the real world. 


Two software simulation models are under construction. 
One is used to study INET issuing mechanism, the other is 
used to investigate DISC system. Our future effort is to set up 
a [1,8,m,1] DISC system model. Then, benchmark programs 
will be written or collected to study the system performance on 
the model. When DISA concept is proven, we will concentrate 
on writing a DISC compiler and post-compiler to translate a 
program from C-language into DISC assembly language. At 
this early stage of research, we feel that we have invented a 
simple way to incorporate the data flow concept into von 
Neumann computer architecture. DISA enables us to speed up 
the program execution at the instruction set level in a very nice 
and efficient way. 


Reference 


[1] R.D.Acosta, J.Kjelstrup and H.C.Torng,"An instruction 
issuing approach to enhancing performance in multiple 
functional unit processors", IEEE Tran. Com., vol.c-35, 
no.9, pp.815-828, Sep. 1986. 

[2] Y.N.Patt, W.M.Hwu and M.Shebanow,"HPS, A new 
microarchitecture: rationale and introduction", the Proc. of 
the 18th Microprogramming Workshop, pp.103-108, 
Dec. 1985. 

[3] J.E.Smith,"A study of branch prediction strategies’ 
Int. Symp. on Com. Arch., pp.135-148, May 1981. 

[4] W.W.Hwu and Y.N.Patt, " Checkpoint repair for 
high-performance out-of-order execution machines", IEEE 

Tran. on Com. v.C-36, n.12, pp.1496-1514, Dec. 1987. 

[5] Arvind and R.S.Nikhil," Executing a program on the MIT 
tagged-token dataflow architecture ", the Proc. PARLE 
Conf., Eindhoven, The Netherlands, Jun. 1987. 

[6] L. Wang, D.C.Zu and J.M.Chai," DISC Post-compiler 


Structure and algorithm", University of Texas, ECE 
Department, Internal Tech. Rep. No. DISC-H-88-01. 


', 8th 


An Improved Approximation Algorithm 
for Scheduling Pipelined Machines 


David Bernstein 


IBM Research 
T. J. Watson Research Center 
Yorktown Heights, NY 10598 


Abstract 


Consider a pipelined machine which can issué one instruction 
every machine cycle, but can use its result only d + 1 machine 
cycles after it has been issued. Instruction scheduling is an 
important phase of the compilation process whose goal is to 
generate optimized code for such machines. Since the problem 
of producing optimal instruction schedules for pipelines for 
arbitrary expressions, with possibly common subexpressions, is 
NP-complete (except of a few restricted cases), we concentrate 
on approximation solutions. A class of scheduling algorithms, 
called Jeveling algorithms, is defined and analyzed. The basic 
leveling algorithm sometimes yields bad schedules such that the 
ratio of their length over the length of an optimal schedule can 
be made arbitrarily close to 2 — 1/(d + 1) which is the upper 
bound of /ist schedules. We refine this algorithm to improve the 
worst case ratio to 2 — 2/(d + 1). The time complexity of the 
refined leveling algorithm is O(na(n) +e log n) where n is the 
number of instructions, e is the number of dependences among 
the instructions, and a(n) is a very slow-growing function. 


1. Introduction 


Pipelining is a common technique for building fast processors. 
In contrast to parallel processing, in which computational 
jobs can be initiated simultaneously, only one instruction can 
be issued every machine cycle in a pipelined machine; several 
instructions may be executed concurrently, one in every stage 
of the pipe. In general, recently designed computer 
architectures ({HB84], [K81]) include both pipelining and 
_ parallelism. In this paper we concentrate on the effect of 
pipelining which mostly characterizes recently proliferating 
RISC machines [R83], [K84], [P85]. 


Pipelining may cause the insertion of NOPs (No 
OPerations) into the sequence of machine instructions either 
by hardware or software. In both cases a certain penalty is 
paid in increased execution time. Minimizing the number of 
NOPs increases the effective speed of the machine. It is a 
task of the compiler that produces code for a pipelined 
machine to schedule the instructions as to get rid of maximum 
number of NOPs. Previously, this problem was tackled both 
in production compilers by implementing different (heuristic) 
algorithms and in theoretical scheduling papers. 
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Li assumed identical delays of all the instructions in the 
pipeline [L77]. In this case, assuming that the input is limited 
to tree expressions, an optimal computation can be 
constructed by executing first the instructions furthest from 
the root of the tree. Directed acyclic graphs (dags) were 
considered by Bruno et. al. [BJS80]. They showed that if the 
delays of all the instructions are equal to one time unit, then 
Coffman-Graham’s algorithm ({CG72]) can be used to 
produce an optimal solution. This result was generalized in 
[BG86] and [BRG87] where an optimal solution was given 
for dags for the case when the delays are either 1 or 0. 
However, if no bound is put on the maximal delay d then the 
problem of finding an optimal computation turns out to be 
NP-complete as was proved in [HG83] for dags. A recent 
survey on the complexity of scheduling for pipelined 
machines can be found in [LLM87]. Since it is unlikely to 
find a polynomial solution to the pipeline scheduling problem 
for arbitrary d, we turn to approximation algorithms and 
study their worst case behavior. 


In [AH82], [HG83], and [GM86] different heuristic 
algorithms were implemented in production compilers, but for 
none of them worst case bounds are known, even though 
satisfactory results were reported on average. In [BRG87] it 
was proved that an upper bound for /ist schedules ([C76]) on 
pipelined machines is 2 — 1/(d + 1). So, the question is how 
better than this upper bound we can do. 


We propose an algorithm that follows the critical path 
approach by assigning a /evel to each instruction. Then, a 
computation is constructed in such a way that instructions at 
higher levels are computed first. This has been a natural 
heuristic approach to multiprocessor scheduling ([CL75], 
[LS77], [S76b]). Unfortunately, there are examples in which 
this basic leveling algorithm can perform on dags as badly as 
the upper bound of list schedules. 


Then, we define a refined leveling algorithm that improves 
the worst case ratio of the length of the schedule produced by 
the algorithm over the length of an optimal schedule: to 
2 — 2/(d +1). Also, we mention a family of examples that 
approaches this worst case ratio arbitrarily closely. In 
[BRG87] the same worst case ratio of 2 — 2/(d + 1) was 


proved for Coffman-Graham’s algorithm, but on average the 
refined leveling algorithm is advantageous. 


The time complexity of the refined leveling algorithm is 
O(na(n) +e log n) where n is the number of instructions, e is 
the number of dependences among instructions and a(n) isa 
very slow-growing function. However, the refined leveling 
algorithm (similarly to Coffman-Graham’s algorithm) 
requires the given dag of dependences among instructions to 
be free of transitive edges. Usually, it can be assumed that 
the dag is given in that form, but if it is not true, the removal 
of transitive edges can dominate the time complexity of all 
the process since it takes time O( min(en, n°°')) to do that 
[G82]. 


In the next section we start with some preliminary 
definitions. Then, in Section 3 the leveling algorithm is 
described, and we conclude with directions for future 
research. 


2. Background 


The scheduling model we consider consists of a single 
processor P and a job system T = (J, D, G). T comprises a set 
of unit execution times jobs J = {J,, ... J,$, a set of delays 
D={D,,...,D,} where D,«€ {0,...,d$ for some fixed 
integer d, and a directed graph G=(J,E) of precedence 
constraints. (The delays model the pipelined structure of P.) 
In this paper we limit ourselves to consider the case where for 
all i, D, = d. 


A legal schedule is defined as a one-to-one mapping S 
from the elements of J into the set N of positive integers 
(interpreted as time slots) such that for all (UJ,J)) € E, 
SU;) — SU) > D,. A time slot of S, in which no job can be 
executed because of delay limitations, is called a NOP. 


We assume that G has no transitive edges since they do not 
impose additional restrictions on a schedule S of T. Also, the 
leveling algorithm which will be presented in Section 3 
requires to distinct transitive and non-transitive edges of G. 


For example, consider the job system of Figure 1(a). The 
jobs are represented by circles, and their indices appear inside 
the circles. Also, it is assumed that d= 2. Two legal 
schedules for the job system are shown in Figure 1(b), where 
iin column j means that J, is executed in time slot 7. Notice 
that time slots 7 and 11 of S' are NOPs since D; = 2 and 
De = 2. 


The completion (maximum finishing) time c(S) of a 
schedule S is defined by max S(J/,). For example, in Figure 
1(b), c(S') = 14 and c(S”) = 13. In this work we will be 
interested in minimizing the completion time, which is 
equivalent to minimizing the number of NOPs. An optimal 
schedule S is a legal schedule for which c(S) is smallest. It 
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turns out that S’ of Figure 1(b) is an optimal schedule for the 
job system of Figure 1(a). 


3. Scheduling algorithms 


3.1. List schedules 

Let T = (VJ, D, G) be a job system with n jobs. If UJ) € E 
we say that J, is an immediate successor of J, and J; is an 
immediate predecessor of J. Given a schedule S for T, J, is 
ready in time slot k, if each of its immediate predecessors J, 
has been scheduled not later than time slot k — 1 — D.. 


Now we consider an important class of schedules, called 
list schedules ({C76]). Informally, given a priority list L of the 
jobs of J, the list schedule S that corresponds to L can be 
constructed by the following procedure: 


1. Iteratively schedule the elements of S starting in time slot 
1 such that during the i-th step, L is scanned from left to 
right, and the first ready job not yet scheduled is chosen 
to be executed in time slot /. 

2. If no such job is found, a NOP is inserted into S in time 
slot i. 


Consider a class of optimal schedules for T. Since all the jobs 
in T have unit execution times, there is no reason in optimal 
schedules to leave the processor P idle whenever a ready job 
exists. Therefore, for our problem, an optimal schedule can 
always be found among list schedules. The obvious question 
is how to obtain the right priority list L. 


Analyzing a class of list schedules, we would like to know 
how far from the optimum an arbitrary list schedule can be. 
Let us denote optimal schedules by S,,, and arbitrary list 
schedules by S,,,. The upper bound for list schedules was 
proved in [BRG87] to be as follows: R = c(S;,,)/c(S,,,) 
<2-—1/(d+1). In the next section, an algorithm that 


improves on this upper bound is presented. 


3.2. Leveled schedules 

In this section, a subclass of list schedules, called Jeveled 
schedules is considered. If J, « J has no immediate successors, 
we say that J, is a sink of G. Also, let JS(J,) (or IS, for short) 
be the set of immediate successors of J,. 


First, the original leveling algorithm that was introduced at 
first in [BRG87] is described. The J/evel /(J,) of a job J; is 
defined as follows: 


J, is a sink of G 
otherwise 


0 
lee : + max /(X) 
Xe 1S; 


Notice that the total execution time of the jobs does not 
affect the levels as defined above. 


Let L be a priority list of the jobs in J constructed in a 
non-increasing order of their levels (the order among the jobs 
of the same level is arbitrary). A schedule S corresponding to 
such an L is called a leveled schedule. Intuitively, in leveled 
schedules we first schedule jobs whose delays are maximal, 
hoping that the NOPs induced by these jobs will be replaced 
by other jobs. 


However, it turns out that the leveling algorithm defined 
above is not successful enough. For example, in Figure 1(a), 
for 9<i< 12, (J) =0, for 6<i< 8, (J) =2 and for 
1<i< 5, (J) = 4. This may lead to a priority list 1,2, ...,12 
that results in a non-optimal schedule S’ of Figure 1(b). In 
general, it was shown in [BRG87] that in the worst case the 
leveling algorithm described above does not improve on the 
upper bound of list schedules. 


In the sequel, a refined leveling algorithm (or RL for short) 
that improves on the upper bound of list schedules is defined. 
Let the refined level of J, be denoted by ri(VJ,) and let 
M, = rlVJ,), ... Ji.) be a sequence on non-negative 
integers constructed from the refined levels of the immediate 
successors of J, ordered in a way that r/VJ,) > ... > riV,.). 
Then, r/(J,) is defined recursively as follows: | 


1. IfJ,is a sink of G then r/(/,) = 0. 

2. Otherwise, rl(J,) = D, + max(7lV/,),rIVJ,) +1,..., 
r(J,,.) +1 1S,| — D. 

Apparently, the number of the immediate successors of a job 

and their refined levels are taken into consideration while 

computing the refined level of a job. A priority list L 

produced by RL is computed as follows: 


1. Compute the levels /(J,) for all i. 

2. Compute the refined levels r/(J,) for all i. 

3. Create a priority list ZL by first ordering J, in a 
non-increasing order of /, and then ordering the jobs with 
the same value of / in a non-increasing order of r/. The 
order among the jobs with the same values of / and 7/ is 
arbitrary. 


Using the refined levels of the jobs to create a priority list as 
described above, results in somewhat less arbitrary decisions 
which are made for the jobs of the same level as compared to 
the original leveling criterion. 


For example, consider the job system of Figure 1(a), and 
let us demonstrate how the refined levels are computed. 
First, for 9<i< 12, riJ) =0. Then, M(4) = MU) = 
M(J,) = {0,0,0}. Therefore, r/(J,) =rlVJ;) =rl(Jg) = 4. Then 
we get MU,) =M(J,) =M(VJ3) = {4$. Therefore, r/(J,) = 
rl(J,) =rlJ,) =6. On the other hand, M(J,) = 
M(J;) = {4,4$. Therefore, r/(J,) =rl(J;) = 7. This leads to a 
priority list 4,5,1,2,3,6,...,12 that results in an optimal 
schedule S” of Figure 1(b). 
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‘leveling algorithm can be 


Because of lack of space we are not able to present the 
analysis of RL and only mention the two main results: 


1. The worst case ratio of the length of the schedule S,, — 
produced by RL over the length of an optimal schedule 
S,, is as follows: R = c(S,)/c(S,,,) § 2 — 2/(d + 1). 

2. The time complexity of RL is O(a(n)n + e log n) where n 
is the number of instructions, e is the number of 
dependences among the instructions, and a(n) is a 


functional inverse of Ackermann’s function. 


4. Conclusions 


In this paper we presented a refined leveling algorithm to 
schedule instructions under pipelined constraints whose worst 
case ratio is 2 — 2/(d + 1). This result is proved for the case 
when all the delays are exactly d machine cycles. One of the 
relaxations of our pipelined model is to allow the delays to be 
any integer between O and d. The question is how to extend 
RL to this case in order to achieve the worst case ratio of 
2 —2/(d +1). 


Another direction for further research is to search for a 
scheduling algorithm which improves on RL. RL 
asymptotically achieves its worst case bound of 
2 — 2/(d +1) on a complex job system presented by Lam 
and Sethi in [LS77], Fig 10. One of the alternatives to 
improve RL is to construct a priority list L of jobs in the 
non-increasing order of their refined levels (without taking 
into consideration the basic levels / at all). This extended 
shown to do better than 
2 — 2/(d + 1) on all known families of worst case examples 
including that of [LS77]. Our conjecture is that the worst 
case bound of this algorithm is not 2 anymore when d 
increases, however, we are not able to proof this claim at a 
moment. We conclude by demonstrating in Figure 2 a job 
system that, for the best of our knowledge, is worst for the 
extended leveling algorithm we propose. 


The job system T = (J, D, G) of Figure 2 consists of k + 1 
groups of jobs. The group 7, 0 < ¢t< k, consists of d type-A 
jobs (A,,...,A4,,) and d type-B jobs (B,,...,B,). The 
precedence constraints described in Figure 2 are such that for 
all t, 1 <t<k, every type-A (type-B) job of group ¢ — 1 is 
an immediate successor of every type-A (type-B) job of 
group ¢. Executing the jobs of group ¢ in order A,, ... ,A,7, 
B,,...,B,, results in an optimal schedule with no NOPs. 
Thus, c(S,,,) =n = 2d(k + 1). 

It turns out that by applying the extended leveling 
algorithm to T, we get that all the jobs of the same group 
have the same refined level. Thus, we might get a priority list 
L in which the jobs of group ¢,0 < ¢< k, appear in order 
Ais +++ Ag) Buy... Bid By applying a list scheduling 
process to L, we get a schedule S that has d — 1 NOPs after 
each A,, job except of Ay, Thus, c(S) =n + k(d — 1) and 


R= c(S)/c(S,,,) = 1+ k(d — 1)/2d(k + 1). By increasing 
k, R can be made arbitrarily close to 3/2 — 1/2d. 
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Abstract: 


We address the problem of processor partitioning in parti- 
tionable systems used for special-purpose applications. We 
demonstrate that the partition size for a task should depend 
on the task characteristics, the workload, and the avail- 
ability of resources. Thus, to maximize throughput, the 
partition sizes for a set of tasks should be determined at 
run time. Such an approach could be supported in special- 
purpose applications since the set of tasks the system needs 
to support are usually known in advance. We first show 
that given a set of tasks and their characteristics, the prob- 
lem of determining the optimal partition sizes for the set 


of tasks is NP-Complete. We then present a polynomial 
time approximation algorithm for this problem. We also 
derive a worst-case bound on the solution obtained by the 
algorithm as compared to the optimal solution. 


Section 1: Introduction 


Partitionable architectures, also called Multiple SIMD- 
/MIMD architectures or MSIMD/MIMD architectures, con- 
sist of a set of processors and controllers [NUT77, PRE80, 
SIE81]. Such architectures can be partitioned into indepen- 
dent subsystems, each comprising of a variable number of 
processors and a controller assigned to the execution of a 
task. Each of these subsystems may either be in the SIMD 
or MIMD mode of computation. In addition to the flexibil- 
ity of supporting both SIMD and MIMD modes, the ability 
to form multiple independent subsystems to execute several 
tasks in parallel provides such a system with the potential 
of achieving better utilization of processing resource. 


An important problem that needs to be addressed in the 
partitioning of these systems is one of determining the num- 
ber of processors allocated to each subsystem, that is, the 
partition size for each task. One possible approach to de- 
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termining this size is to first derive the maximum degree of 
parallelism available in the program, and then choose the 
partition size to be either this maximum degree of paral- 
lelism or the maximum number of processors in the system 
that may be allocated to the task, whichever is smaller. 
Such an approach to determine the number of processors 
allocated to a task has been used widely in conventional 
SIMD and MIMD systems, and much work in the areas of 
programming languages and compiler design has been done 
to support this approach [KUC77]. 


Such an approach, however, may not be optimal from the 
standpoint of either minimizing the execution time for a 
task or the completion time for a set of tasks (we define 
completion time to be the least time by which all tasks 
in the set have completed execution). Since most parallel 
programs require communication among processors during 
their execution, for many parallel programs, the communi- 
cation between processors may play a dominant role in the 
overall execution time with increasing partition size. As 
a result, the improvement in execution time may level off 
as the partition size increases. In other words, there is an 
effect of diminishing return in performance with larger par- 
tition sizes. Furthermore, beyond a certain partition size, 
the execution time may actually increase. The optimal par- 
tition size for a task depends on the computation and com- 


munication structure of the program, the size and values 
of the input data, and the computation and communica- 
tion support of the system [LIN81, NIC87, MA87, MA88]. 
This size could be smaller than both the maximum degree 
of parallelism in the program and the maximum number of 
processors that may be allocated to the task. 


In a partitionable system, due to the effect of diminish- 
ing return in performance with larger partition sizes, when 
there is a multiple number of tasks ready for execution, us- 
ing a smaller partition size for each task and executing as 
many tasks in parallel as possible could lead to a shorter 
completion time than using the partition size that gives the 
minimum execution time for each individual task. For ex- 
ample, in a simulation study of a histogramming algorithm 
[KUE84], Kuehn and Siegel have shown that given a set of 
four histogramming tasks of the same size ready for exe- 


cution and a partitionable system of 256 processors, using 
a subsystem of 64 processors for each task and executing 
the four tasks in parallel gives a shorter completion time 
than using a partition of 256 processors for each task and 
executing the four tasks sequentially. 


Given a set of tasks which are ready for execution, the op- 
timal partition sizes for these tasks depend on the number 
of tasks in the set, their characteristics, and the amount 
of available resource. Since the information on what tasks 
are ready for execution, which we refer to as workload, and 
what resources are available for allocation cannot be de- 
termined until run time, the optimal partition sizes can be 
determined only at run time, and not at program design 
time or at compile time. Furthermore, in order to deter- 
mine at run time the optimal partition sizes for a set of 
tasks, it is necessary for the system to know the character- 
istics of each individual task in the workload. Such char- 
acteristics, however, could be difficult to obtain for a sys- 
tem designed for general purpose application because the 
tasks the system needs to support may vary widely. But 
for a system designed for a special-purpose application, the 
tasks the system needs to support are relatively fixed and 
known in advance. For example, for the application of im- 
age processing, such tasks include FFT, Histogramming, 
Convolution and Image Smoothing. It is thus possible to 
pre-analyze the characteristics of the tasks and make them 


available to the system. Asa result, a partitionable system 
for special-purpose application can be designed with the 
ability to determine optimal partition sizes at run time. 
The feasibility and advantage of this approach is naturally 
determined by the overhead involved, which include the 
effort to pre-analyze the task characteristics, the storage 
required to record these characteristics in the system, and 
most importantly, the time it takes for the system to de- 
termine these partition sizes. In this paper, we focus on 
analyzing the time complexity of using such an approach 
to determining optimal partition sizes, assuming that the 
required information on task characteristics can be made 
available to the system. We show the problem of deter- 
mining such optimal partition sizes to be NP-Complete; we 
also propose a polynomial time approximation algorithm 
for this problem and derive the performance bound for the 
algorithm. 


In Section 2, we illustrate through a sequence of examples 
the impact of task characteristics and workload on optimal 
partition sizes. 
partitioning problem, review the multiprocessor scheduling 


In Section 3, we formulate the processor 


problems in the literature related to this problem, and es- 
tablish the NP-completeness of this problem. In Section 
4, we propose a polynomial time approximation algorithm 
for the partitioning problem and derive its performance 
bound. In Section 5, we apply the approximation algo- 
rithm on some examples to illustrate the possible reduction 
in completion time by using the partition sizes determined 
by the algorithm as opposed to using the partition sizes 
that minimize the execution time for each individual task. 
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Impact of Task Characteristics and Work- 
load on Optimal Partition Size 


Section 2: 


We illustrate the impact of task characteristics and work- 
load on optimal partition sizes with some examples on the 
following model of a partitionable system. The system con- 
sists of 512 processors interconnected as a linear array. The 
links in the array are bidirectional. The system can be par- 
titioned into several subsystems, each of which consists of a 
subset of consecutive processors in the linear array, operat- 
ing in the SIMD mode. Furthermore, the time to communi- 


cate a data item between two adjacent processors equals the 
time to perform an arithmetic or logical operation over two 
Note that the characteristics of the architec- 
ture, particularly those of-the supporting interconnection 
network, have an important impact on task characteristics, 


data items. 


and thus also affect the optimal partition sizes. More de- 
tailed analyses of the impact of the characteristics of tasks, 
workload, and system on optimal partition sizes are given 
in [MA88]. 

In the following examples, let N denote the number of data 
items for a task, and K denote the partition size allocated 
for the task. To simplify our presentation, we restrict K 
to be those integers such that N is divisible by K. Let T 
denote the completion time for a set of tasks executed by 
the system, which is the least time by which all tasks in the 
set have completed execution. If the set consists of only one 
task, then T is simply the execution time of the task, which 
is the sum of the computation and communication time for 
the execution of the task. 


2.1: Impact of Task Characteristics on Optimal 
Partition Size 


Example 1. Summing N Numbers 


We use a recursive doubling algorithm to sum the N num- 
bers. Initially all the processors are active, and each pro- 


cessor is assigned = numbers. Each processor first forms 
N 
K 
accumulated to form the final sum in log K iterations. In 


each iteration, starting from the leftmost active processor, 
every alternate processor sends its partial sum to the ac- 


the partial sum of = numbers. These partial sums are then 


tive processor immediately to its right. An active processor 
that receives a data forms a new partial sum by adding the 
received data to its own partial sum. At the end of an iter- 
ation, all the sending processors are disabled. This parallel 
algorithm takes (2 —1)+ log K additions and K —1 com- 
munication operations on the linear array. The completion 
time T is given by 


f Aro 


>| 2 


+ K + log K —2. 


For 1<K < NJ, the computation time, which is x —1+ 
log K, is a decreasing function of K, and the communica- 
tion time, which is K — 1, is an increasing function of K; 
combining the effects of both, the execution time is concave 
upward with respect to K, with the minimum occuring at 
some K’ between 1 and N. The variation between T and 
K for N = 512 is shown in Figure 1. For N = 512, the 
execution time has minimum at K = 16, and this is the 
optimal partition size for one summing task of size 512 on 
the given partitionable system. § 
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Figure 1 Summing 512 Numbers 
Example 2. Sorting N Elements 


We use the parallel algorithm given in [BAU78] to sort the 
N elements. The algorithm is a generalization of the odd- 
even transposition sort. Each processor is assigned x el- 
ements. These elements are first sorted in each proces- 
sor. The resulting subsequences are then merged and redis- 
tributed for K iterations to form the final sequence. For all 
odd iterations, processor ?+1, where: = 1,3,...,2 [+] —1, 


first sends its subsequence to processor 7, processor 2 then 
merges the two subsequences it has, retains the first half of 
the resulting subsequence (the ~ smallest elements), and 


K 
sends the second half of the subsequence (the ~ largest 


elements) back to processor i+ 1. For all even iterations, 
the same steps as the odd iterations are executed but for 
i= 2,4,...,2[%5+]. After K such iterations, the final 
sorted sequence will be partitioned among the N proces- 
sors, with each processor holding a subsequence of = el- 
ements. These subsequences are in increasing order from 
processor 1 to processor N. Since the initial sorting takes 
* log = comparisons and each iteration takes 22 com par- 
isons and 2> communication operations on the linear array, 
the overall completion time is given by T = x log x +4N. 
For1<K <WN, the computation time, which is x log * + 
2N, is a decreasing function of K, and the communication 
time, which is 2N, is a constant with respect to K; as a 
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result, the execution time is a decreasing function of K, 
with the minimum occuring at K N. The variation 
between T and K for N = 512 is shown in Figure 2. For 
N = 512, the execution time has minimum at K = 512, 
and this is the optimal partition size for one sorting task of 
size 512 on the given partitionable system. 


As illustrated in the above two examples, the optimal par- 
tition size for a task depends on the computation and com- 
munication requirements of a task, which in turn are deter- 
mined by the characteristics of the corresponding program, 
input data, and supporting architecture. For the task of 
summing, the optimal partition size is smaller than the 
maximum degree of parallelism in the task, but for sort- 
ing, these two quantities are equal. 


2.2: Impact of Workload on Optimal Partition Size 


Due to the need for communication in most parallel pro- 
grams, as we increase the number of processors allocated 
to a task by a factor of k, the execution time of the task 
is usually reduced by a factor less than k. For instance, 
for the summing task with N = 512 in Example 1, as the 
partition size increases from 1 to 16 by a factor of two each 
time (that is, from 1 to 2, 2 to 4, 4 to 8, and 8 to 16), the 


execution time decreases from 511 to 50 by factors of 1.988 
(511 to 257), 1.947 (257 to 132), 1.808 (132 to 73), and 
1.460 (73 to 50); when the partition size increases beyond 
16, the execution time increases. For the sorting task with 
N = 512 in Example 2, as the partition size increases from 
1 to 512 by a factor of two each time, the execution time de- 
creases from 6656 to 2048 by factors of 1.625, 1.391, 1.210, 
1.101, 1.046, 1.019, 1.008, 1.003, and 1.001. Due to such di- 
minishing return in performance with larger partition sizes, 
when there are a multiple number of tasks ready for exe- 
cution in a partitionable system, using a smaller partition 
size for each task and executing as many tasks in parallel 
as possible could lead to a shorter completion time than 
using the partition size that gives the minimum execution 
time for each task. We illustrate this impact of workload 
on optimal partition size in the next example. 
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Figure 2 Sorting 512 Elements 


Example 3. Multiple Sorting Tasks of N Elements 


Suppose we have eight sorting tasks to be executed, with 
each task having to sort N elements. Assume that each 
task is allocated a partition of size K. Since there are 512 
processors in the system, for K < 64 the eight sorting tasks 
can be executed in parallel, and the completion time T for 
these eight sorting tasks is the same as the execution time 
of one sorting task of size N using a partition of size K. 
However, for K > 64, the eight sorting tasks have to be 
executed in multiple batches, with each batch, except pos- 
sibly the last, having |*}*| tasks. The completion time T 
in this case is the product of [2%] (the number of batches) 
and the execution time of one sorting task on a partition of 
size kK. Figure 3 shows the variation between the comple- 
tion time J and partition size K for N = 512. The least 
completion time for a set of eight tasks is obtained when 
k = 64, and this is the optimal partition size for each such 
task. The corresponding completion time T for the eight 
tasks is 2072. On the other hand, if we use a partition 
size of 512 for each of the eight tasks, which is the optimal 
partition size for the execution of one task, the completion 
time T is 16384 instead. 


The least completion time for a set of sixteen tasks is ob- 
tained when K = 32, and this is the optimal partition size 
for each such task. The corresponding completion time T 
for the 16 tasks is 2112. If we use a partition size of 512 for 
each of the sixteen tasks, the completion time T is 32768 
instead. 


As illustrated in the above examples, determining the par- 
tition sizes at run time based on task characteristics, work- 
load, and amount of resource available could provide higher 
throughput than determining such partition sizes at either 
the program design time or compile time. For a parti- 
tionable system designed for special purpose applications, 
since the set of tasks the system needs to support is usually 
known in advance, by pre-analyzing the characteristics of 
the tasks and making them available to the system, it is 
possible for it to determine the partition sizes at run time. 
The feasibility and advantage of such an approach is deter- 
mined by the overhead involved. In the remainder of this 
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paper, we focus on analyzing the time complexity of the 
problem of determining optimal partition sizes at run time. 


Section 3: The Processor Partitioning Problem 


In multiprocessor scheduling problems, we are given a 
number of processors and a set of tasks, and the goal is 
to schedule the tasks on these processors such that some 
objective on the execution times of the tasks is optimized. 
These are variants of the general multiprocessor schedul- 
ing problem, and most of these problems are NP-Complete 
(GAR79]. In the following, we first review the general mul- 
tiprocessor scheduling problem and two of its variants most 
related to our problem. 


Let Z* denote the set of positive integers. Let n denote the 
number of processors and m the number of tasks, where n, 
m € Zt. For any positive integer k, we use [k] to denote 
the set {1,---,k}. 


The general multiprocessor scheduling problem can be stated 
as follows [GAR79]: Given n processors and m tasks, where 
each task requires one processor for execution with a spe- 
cific execution time and there is no precedence constraint 
among the tasks, the objective is to find a nonpreemptive 


schedule with the least completion time for all the tasks. 
This problem is NP-Complete in the strong sense for arbi- 
trary n, but can be solved in pseudo-polynomial time for 
any fixed n. The problem remains NP-Complete for n = 2. 


A variant of the general multiprocessor scheduling prob- 
lem is the multiprocessor scheduling problem with nonfrag- 
mentable resource constraint [GAR75]. In a special-case of 
this problem, we are given n processors and m tasks, each 
task requiring one processor for execution with execution 
time equal to unity and no precedence constraints among 
the tasks. Further, we are given a resource R with a total 
amount B available, and a nonnegative resource require- 
ment R(i) for each task i € [m]. The objective is to find a 
nonpreemptive schedule with the least completion time for 
all the tasks such that the sum of the resource requirements 
of all the tasks scheduled simultaneously does not exceed 
the total amount of the resource available. An important 
characteristic of the problem is that the resource does not 
have to be allocated in contiguous blocks since the resource 
does not suffer from fragmentation. This problem is shown 
to be NP-Complete by transforming the Three-Dimensional 
Matching problem to this problem [GAR75]. 


Yet another version of the scheduling problem is the mul- 
tiprocessor scheduling problem with fragmentable resource 
constraint [BAK83]. In this, tasks share a resource such 
as memory, where such a resource may only be allocated 
in contiguous blocks. In this problem, we are given n pro- 
cessors and m tasks. We are also given a resource R with 
a total amount B available, and a nonnegative resource 
requirement R(z) for each task i € [m]. Once again, the 
objective is to find a nonpreemptive schedule with the least 
completion time for all the tasks such that the sum of the 
resource requirements of all the tasks scheduled simulta- 
neously. does not exceed the total amount of the resource 


available. The distinguishing characteristic between this 
problem and the earlier problem is that the resource may 
only be allocated in contiguous blocks since the resource 
is fragmentable. This problem is NP-Complete since it is 


equivalent to the 2-D bin packing problem. 


In the processor partitioning problem, we are given n pro- 
cessors, r controllers, and m tasks among which there are 
no precedence constraints. Each task can be executed by 
a number of different partition sizes. The partitions may 
com prise of processors which need not be contiguous in any 
address space. The objective is to choose partition sizes 
for the tasks and to find a nonpreemptive schedule with 
the least completion time for all the tasks such that the 
maximum number of tasks scheduled simultaneously does 
not exceed the number of controllers r and the sum of the 
chosen partition sizes of all the tasks scheduled simulta- 
neously does not exceed the total number of processors n. 
The above problem can be shown to be NP-Complete by 
transforming the multiprocessor scheduling problem with 
fragmentable resource constraint to a restricted version of 
the above problem. Details of the proof are omitted in this 
paper. 

The processor partitioning problem that we study in the 
remainder of this paper is a special version of the prob- 
lem stated above. In this version, we are given n proces- 
sors, r controllers, and m tasks among which there are no 
precedence constraints, and each of which can be executed 
by a number of different partition sizes. The partitions 
may comprise of processors which need not be contiguous 
in any address space. For all i. € [m], let g; denote the total 
number of such partition sizes for task 2, and let the func- 
tions p;:[(q;] + [n] and t; :[q;] 7 Z*, define respectively 
the partition sizes and the corresponding execution times 
for task 1. The functions p; and ¢; have properties such that 
for all k,l € [q;] where k < 1, we have the following: 


(a) pi(k) < p;(1) (the partition sizes defined by p; are in 
increasing order) — 


(b) t;(k) > t;(1) (the execution time of a task decreases as 
the partition size increases) 

(c) pi(k) t:(k) < p; (1) t;(1) (the execution time of a task us- 
ing a larger partition size decreases by a factor which is 
less than the increase in partition size since t,;(l)/t;(k) > 
pi(k)/pi(2)) 

In addition, we have the assumption: 

(*) m<rand Yo y, p;(1) < n (all the tasks available for 
execution can be executed in parallel when using the 
smallest partition sizes for these tasks). 


The objective is to choose partition sizes (and their execu- 
tion times, which obey the above three properties) for the 
tasks and to find a nonpreemptive schedule with the least 
completion time for all the tasks such that the maximum 
number of tasks scheduled simultaneously does not exceed 
the number of controllers r and the sum of the chosen parti- 
tion sizes of all the tasks scheduled simultaneously does not 
exceed the total number of processors n. This problem can 


also be shown to be NP-Complete by transforming the mul- 
tiprocessor scheduling problem with fragmentable resource 
constraint to a restricted version of the above problem. De- 
tails of the proof are omitted in this paper. 

In a partitionable system, properties (a) and (b) imply or- 
dering the partition sizes in increasing order of index and 
selecting only that part of the task characteristics where 
the execution time continues to decrease with increasing 


partition size. For example, for the tasks of summing and 


sorting discussed in Section 2.1, we only include five par- 
tition sizes 1, 2, 4, 8, 16 for summing, while for sorting, 
we include all the ten partition sizes. Property (c) implies 
that speed-up in a parallel system is less than linear due 


to communication and control overhead. We make assump- 
tion (*) to simplify the presentation of our solution to the 
problem. Our solution may be extended to the case where 


this assumption does not hold. 


Section 4: 


Solution Techniques for the Processor 


Partitioning Problem 


In this section, we first derive some lower bounds on the 


completion time for the processor partitioning problem. Us- 
ing these lower bounds, we derive a condition on the pro- | 
cessor partitioning problem under which optimal solutions 
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can be determined easily. We then present a polynomial 
time approximation algorithm for the processor partition- 
ing problem. We also derive a worst-case bound on the 


solution obtained by the algorithm and the conditions un- 
der which it gives optimal solutions. 


Let S, be an optimal schedule, and for each i € [m], let u; 
denote the index for the partition size for task 7 in schedule 
S,. Let T, be the completion time for schedule S,, which 
is the optimal completion time. 


4.1: Lower Bounds on Completion Time 

Lemma 1 provides us with a relation between the opti- 
mal completion time T, and the partition sizes and respec- 
tive execution times for the tasks in an optimal schedule 
ee 
ee Be 


Lemma 1. 1 p;(u;)t;(u;)/n. 
The term 0,” 


Proof. ;-1 Pi(ui)ti(u;) represents the total 
time units that the allocated processors are busy in schedule 
S,. Since S, is a feasible schedule, at any time instant, each 
of the n processors is allocated to at most one task. Thus 
each processor is busy for at most 7, time units. Since 
the total number of processors allocated at any instance 
is bounded by n, we have )),_, pi(ui)ti(u;) < nT,. The 
Lemma follows consequently. J 


As per Property (c) stated in Section 3, in a partition- 
able system, as we transit from a given partition size to 
a higher partition size, the execution time decreases by a 
factor which is less than the increase in partition size. This 
property asserts that as we increase the partition size for a 
task, the time to execute on the new partition cannot de- 


crease below a certain limit. Based on this relationship be- 
tween partition sizes and execution times, Lemma 2 derives, 
under certain conditions, a lower bound on completion time 
for the special case where all the tasks to be executed are 
of the same type. 


Assume that there are m tasks of the same 
type to be executed on an n processor system. Let q denote 
the number of possible partition sizes for this type of task. 
In addition, let the functions p : [q] + [m] and t: [q] ~ Zt 
denote the partition sizes and the corresponding execution 
times respectively. Assume further that there exists some 
1 € [q] such that p(l) > [%]. Let I" be the smallest index 
in [q] such that p(I*) > [>]. Then T, > t(I"). 


Lemma 2. 


Proof. For each 2 € [m], let u; be the index in [q] such 
that p(u;) is the partition size for task 7 in some optimal 
schedule S,. For eachi € [m], the completion time for task i 
is t(u;). If for some i € [m], u; < [* then T, > t(u;) > i(1) 
from Property (b) and the lemma is trivially true. Thus 
assume that for every i € [m],u; > I*. 


Property (c) that 


It follows from 


for every 2 € [m], p(l")t(I") < p(u;)t(u;). 


The above implies that 


m p(I*)t(I") < Ly p(us)t(ui). 


i=1 


Since p(l*) > [+], it follows that 


m[—Ja(l") < Do p(us)t(us). 


#=1 


Since [2] > +, we have 


t(I") < D) p(u;)t(us)/n. 


s=1 


Since the optimal schedule S, has completion time T,, from 
Lemma 1, we have T, > t(l*). J 


For the special case where all the tasks to be executed are 
of the same type, Theorem 3 states a condition under which 
a parallel schedule with = processors in each partition has 
the least completion time. 


Theorem 3. Assume there are m tasks of the same type 
to be executed on an n processor system. Let q denote the 
number of possible partition sizes for this type of task. In 
addition, let the functions p : [q] > [m] andt: [q] — Zt 
denote the partition sizes and the corresponding execution 
times respectively. Assume further that n is divisible by m 
and there exists some I* € [q] such that p(I*) = =. Then 
a parallel schedule with ~ processors in each partition has 


the least completion time. 


Proof. Since the allocation is feasible, the theorem fol- 
lows from Lemma 2. J 
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4.2: An Approximation Algorithm for Partition- 
ing 

We now present an approximation algorithm which runs 
in O(min{n, 7, qi }log m) time. This algorithm explores 
only parallel schedules and does not explore any serial- 
parallel schedules. By assumption (*), there always exists 
a feasible, parallel schedule for the given set of tasks. We 
first give an informal description of the algorithm below. 


Initially, each task is allocated a number of processors equal 
to the smallest partition size for this task. By assumption 
(*), such an allocation is always possible. Then, the task 
with the longest execution time is selected. As many pro- 
cessors are allocated to this task as is neccessary to transit 
to the next larger partition size. This process is repeated 
until we run out of free processors. 


Intuitively, the algorithm allocates processors to tasks in 
an efficient manner. To account for the effect of diminish- 
ing return with larger partition size, the algorithm starts 
with the smallest partition size for each task, and increases 
a partition size only if the execution time corresponding to 
such a partition size determines the completion time. Since 
the criterion to be minimized is completion time, the algo- 
rithm isolates the task with the longest execution time at 
every iteration, since this is what determines the comple- 
tion time in a parallel schedule. It then allocates as many 
processors as is neccessary to reduce the execution time of 
this task so as to reduce the overall completion time. If 
this additional allocation results in a different task having 
the longest execution time, the algorithm allocates addi- 
tional processors to this task. Thus, it provides processors 
to tasks that need them the most, in some sense. Given 
below is a more formal statement of the algorithm. 


Approximation Algorithm: Partitioning 


Input: n, m, for every i € [m], qi, pi, t:- 


Output: a set of indexes {l;|l; € [q;] for i € [m]}. 


begin 
TEeMain := Nn; 
for 1:= 1tom do 
begin 
ba 5 
remain := remain — p;(1) 
end; 
done := false; 
while (remain > 0) and (not done) do 
begin 
find j such that t;(l;) = mazjeqi,....my ti(li); 
if (1; < q;) and (remain > p;(l; + 1) 
—p;(I;)) 
then 


begin 

Pe ae 

remain := remain —(p; (1; + 1)—p;(I;)) 
end 
else 


done := true 
end 
end. 


In the above algorithm, the while loop will be executed no 
more than min {n,)0;_,q}. Inside the loop, we have to 
find the maximum of the ¢t;’s, which can be done in time 
O(log m) if we use a priority queue to store the ¢t;’s. The 
rest of the algorithm can be done in O(m) time. There- 
fore, une approximation algorithm has complexity O(min 
{n,d0;-, a} logm). Since D2)", pi(l;) <n, the partition 
sizes sien from the apuner ation algorithm allow a par- 
allel schedule. In the next section, we derive some bounds 
on the performance of the algorithm. 


4.3: Performance Analysis of the Approximation 
Algorithm 


Let S, be the schedule determined by the approximation 
algorithm. For each i € [m], let J; denote the index for 
the partition size for task 7 in schedule S,. Let T, be the 
completion time for schedule S,. By definition, we have the 
following set of inequalities: T, < T,, and for all 7 € [ml, 
ti(uj) < To, ti(li) S Ta. 

Lemma 4 derives a relationship between partition sizes in 
the schedule S, and partition sizes in a feasible schedule 


with completion time no greater than the completion time 
of S,. 


Lemma 4. Let S; bea feasible schedule. For alli € [m], 
let r; denote the index for the partition size for task 7 in 
S,;. Let T; denote the completion time due to schedule S,. 


Assume that T; <T,. Then, for all i € [m],r; > l;. 


Proof. Let ibe an arbitrary element in [m]. We consider 
two cases. 
a) 1; = 1. Since in any feasible schedule, every task needs 


at least as many processors as in the least-sized parti- 
tion, we have r; => l;. 

b) J; > 1. In this case, we have t;(I; — 1) > T,, otherwise 
the algorithm will not augment /; — 1 to 1;. We thus 
have the following relationship: 


t;(7;) < Ty =< t;(U; = 1.) 


This implies that r; > 1; — 1 from Property (b). Thus, 


r,21;.§ 


Lemma 5 derives a lower bound on the optimal completion 
time JT, in terms of the partition sizes and the execution 
times of the tasks in the schedule S,. 
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Lemma 5. T, > 2), pilli)ti(l;)/n. 


Proof. Since S, is a feasible schedule and T, < T,, from 
Lemma 4, we have, for all i € [m],u; > J;. Hence from 
Property (c), we have, for all 7 € [m], 


pi (u;)ts(us) > pCi )ti(h). 


Therefore, 


De pilus yts(us) > D2 pills )te (te). 
t= 1 i=l 
From Lemma 1, we also have 


m 


dp 


~_ 


Thus, T, > oj, pills )ti (li - | 


us )ts (us). 


Theorem 6 derives a worst-case bound on the completion 
time T,. 

Theorem 6. Let k be the index in [m] such that task 
k is the one that determines the completion of T,. Then 


Ta S (m/pe(le)) To — OB: igigm pill; )ts(l)) / pe (li )- 


Proof. From Lemma 5, we have 


pe pill; )ti(li) < nT. 


i=l 


Since t,(l,) = Ta, we can rewrite the above expression as 


> pi(l; )t:(l;) + pe( Ta < nTy. 
1<i<gm 
ifk 


Thus, we have 


T, <nTo/pe(ln)— 25 villi)ts(ls)/pe (le), 
ed 


and the theorem follows. J 


Theorem 7 proves that the schedule due to the approxi- 
mation algorithm is an optimal schedule among all parallel 
schedules. 


Theorem 7. The completion time T, due to the approx- 
imation algorithm is the optimal completion time among 


all parallel schedules. 


Proof. Assume tothe contrary that there exists a parallel 
schedule S, with completion time T, such that T, < T,. 
For each i € [m], let v; denote the index for the partition 
size for task 7 in S,. Let k be the index in [m] such that 
task k is the one that the approximation algorithm tries to 
increase the partition size of before it terminates (in case 
two or more tasks are tied in determining the task with 
the longest execution time). We have ¢,(/,) = T,. From 


Lemma 4, we have, 
for all 2 € [m], v; > l,. 


Consequently, for each i € [m], p;(v;) > p; (I). 
Further, from our assumption that T, < T,, we get t,(v,) < 
t,(l,). This implies that v, > I,, and hence vu, > |, + 1. 
Therefore, p,(v,) > py (1, + 1). It follows that 


> pi(vi) 2 > pi(l;) + pe(le + 1). 
i=1 l1<icm 
if k 


Thus, 


> pi(vi) > > pili) toe(e t+ 1)—re(h). =) 


t=1 t=1 


When the algorithm terminates, the number of remaining 
processors is strictly less than p,(l, + 1) — pe(l,) (task 
k determines the completion time). Otherwise, the algo- 
rithm would have reduced the completion time T, = ty (I; ) 
by augmenting the partition size of task k from p,(I,) to 
pe(l, + 1). Thus, we have, 


n— es pilli) < pe(ly + 1) — pe (ly). 


i=l 


This implies that 


Dd pls) + pe (le + 1)—pr(li,) > n. (2) 


i=l 


Combining inequalities (1) and (2), we have )) " , p:(v;) > 
n, which is a contradiction since S, is a parallel schedule 
and there are at most n processors. Thus, no parallel sched- 
ule S, exists with completion time 7, such that T, < Tq. 
The completion time due to the approximation algorithm 
is therefore optimal among all parallel schedules. y 


Theorem 8 proves that the schedule due to the approxi- 
mation algorithm is optimal under the condition that no 
processors remain to be allocated when the algorithm ter- 
minates, and in addition, the execution times of the tasks 
for the partition sizes allocated by the approximation algo- 
rithm are equal. 


Theorem 8. Assume that 207", p;(l;) = n and for all 
te [m],¢;(1;) = / ae Then T., = i i 


Proof. Assume to the contrary that T, > To. 
Lemma 5 we have 


From 


T, >>. pilli)ti(li)/n. 


t=1 


By the assumption of the theorem, we have 


figs 2 Ms pi(l;)Ta/n. 


i= 1 
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Since 27 ;_, pi(l;) = n, the above implies that 
T, 2 Las 
which is a contradiction. Thus, T, = T.. J 


Corollary 9. Assume there are m tasks of the same type 
to be executed on an n processor system, and n is divisible 
by m. Assume further that there exists some I” € [q] such 
that p(l”) = =. Then the approximation algorithm obtains 
an allocation with the optimal completion time, allocating 


is eg Es : 
partitions consisting of > processors to each of the m tasks. 


Proof. When the approximation algorithm terminates, 
it allocates a partition of size p(l;) = %,for each 7 € [m}. 
Since every task is of the same type, t(/;) = T,,for each t € 
[m]. Since 30%" , p(li) = n, the corollary follows from The- 
orem 8. J 


The analyses in this subsection shows that under certain 
conditions (those stated in Theorem 8 and Corollary 9) 
the approximation algorithm produces an optimal sched- 
ule. Further, if we constrain ourselves to strictly parallel 
schedules, then the schedule due to the approximation al- 
gorithm is always optimal among all such schedules. In 
general, the performance of the schedule due to the ap- 
proximation algorithm is always within the bound given in 
Theroem 6. 


Section 5: Applications of the Approximation Al- 


gorithm 


We now give an example of using the approximation al- 
gorithm in a typical application on the model of a parti- 
tionable system described in Section 2. The application we 
choose to illustrate the approximation algorithm is in the 
area of image processing. In image processing applications 
using stereo images, there is a need to compute the His- 
togram and perform Image Smoothing for a pair of images. 
Since both these computations may be carried out in paral- 
lel, the workload may then comprise of the following tasks: 
two Histogramming tasks and two Smoothing tasks (one 
for the right image, and one for the left image). 


Assume that the image is a square of VN x VN pixels, 
where VN is a positive integer. Let K denote the size of 


a partition, where VK is a positive integer. Assume also 
that N is divisible by K. 


We assume that the image is divided evenly over the K 
processors so that each processor has a square subimage of 
x pixels. In computing the histogram of an image, the fre- 
quency count of each grey level is computed over the entire 
image. The final histogram is represented as an array of 
b elements, each element being a count of the number of 
pixels in the image with that grey level. Each processor 
first computes the histogram of the subimage local to it in 
time x. These partial histograms are then accumulated to 
form the total histogram in log K iterations using a recur- 
sive doubling algorithm similar to the one for the summing 


task given in Section 2, Example 1. Computing the new 
partial histogram in each iteration takes b time units since 
it amounts to a vector addition of b elements. Communica- 
tion in the first iteration takes 6 time units since an array 
of b elements is sent to an adjacent processor. In general, 
for all i = 1,...,log K, communication in the i” iteration 
takes 6+ 2'~! —1 time units since an array of b elements is 
sent to a processor which is 2'~! away, and the array of b 
elements can be sent in a pipelined fashion. The total num- 
ber of computation operations is x + blog K, and the total 
number of communication operations is (b—1) log K+ K —1. 
The execution time for Histogram computation is given by 


N 
T = + K —1+ (2b—1)log, K, 


For N 512 x 512, 6 = 256, and K varying from 1 to 
4096 as squares of powers of two, the variation of the ex- 


ecution time T with partition size K is shown in Figure 


— 


256 
| T | 262144 | 66561 | 18443 | 7225 | 5367 


Figure 4 Histogramming 


In the problem of Image Smoothing, the grey value of each 
pixel in an image is averaged with the surrounding eight 
neighbouring points for a given number of iterations. In 
each iteration of the algorithm, each processor needs to 
perform gz additions and = divisions, and send a message 
to each of the eight processors operating on its surrounding 


subimages; four of these messages are of size fx elements, 


and four of size one element. If the subimages are mapped | 
row by row into the linear array, each processor has to com- 

municate with two processors at a distance of one, two at 

a distance of VK, two at a distance of VK + 1, and two at 

a distance of WK —1. The net communication time on the 

linear array is thus: 


of X42 
Kk 


The total execution time for Smoothing is given by 


eae 
KK 


(VK —1)+2(VK +1). 


N 
T=9—+4+2 
kT 


ra 
VX avW 4 ave. 


For N = 512 x 512, the variation of the execution time T 
with partition size K is shown in Figure 5. 


148752 | 38048 


Figure 5 Image Smoothing 
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Next we apply the approximation algorithm for the job mix © 
of two Histogramming tasks and two Smoothing tasks, all 
of which are available for execution. The approximation 
algorithm stops after 18 iterations with a schedule whose 
overall completion time is 5367 time units. Table 6 shows 
the partition sizes for the tasks and the completion time. 
The table has one entry for every two iterations since the 
partition size has to increase for both Smoothing tasks or 
both Histogramming tasks to reduce completion time at 
each iteration. 


z 
16 
16 
z 
z 
256 
256 


Figure 6 Iterations in Approximation Algorithm 


ie 
a ae 
Ee 
ee 
a 
a 
ee 
ee 
6 
256 


6 
6 
4 
4 


If we use the maximum partition size consisting of all 4096 
processors for each task, and a strictly sequential sched- 


ule to execute the tasks, then the overall completion time 


is 24326 time units, implying a factor of 4.5 improvement 
in the overall completion time with the parallel schedule 
obtained by the approximation algorithm. If, on the other 
hand, we use the partition size with the least execution time 
for each task (the optimal partition size for the execution 
of a single task) and the best schedule possible to execute 
the tasks, then the overall completion time is 9111 time 
units, implying a factor of 1.6 improvement in the overall 
completion time with the parallel schedule obtained by the - 
approximation algorithm. ; 


Applying Theorem 6 to obtain a worst-case performance 
bound for the approximation algorithm in the above exam- 
ple, we get 


4096 


256 


2 X 1024 x 3488 + 256 x 5367 


Las 
_ 256 


0 ? 


which implies that 
T, <167, — 33271, 


from which we can infer that T, > 2415. Using this bound, 
we can deduce that the approximation algorithm obtains a 
completion time which is less than 2.23 times the optimal 
completion time. 


For this particular example, the schedule determined by 


the approximation algorithm is actually optimal among all 
possible schedules, that is, T, = T, = 5367. 


Section 6: Conclusion 


In this paper, we address the problem of processor par- 
titioning in partitionable architectures. We demonstrate 
the importance of determining the partition sizes based on 
task characteristics, workload, and availability of resources. 
An underlying assumption is that the task characteristics 
are available. For a system designed for a special-purpose 
application, it is possible to pre-analyze the characteris- 
tics of the tasks and make them available to the system 
since the set of tasks the system needs to support is usually 
known in advance. For such systems, we advocate deter- 
mining the partition sizes at run time. To support such an 
approach, we investigate the design of an efficient approxi- 
mation algorithm to determine the partition sizes based on 
task characteristics and workload. We derive the worst-case 
performance bound for the approximation algorithm, and 
conditions under which the algorithm is optimal. 


Other important issues that may affect the feasibility of 
such an approach such as the overhead involved and the 
fragmentation of the processing resources in such systems 
will be studied in the future. The fragmentation of proces- 
sors in such systems is influenced by the network intercon- 
necting the processors in the system. We have preliminary 
analysis [KRI87] on the physical subset of processors that 
may comprise a partition in a partitionable system. We 
plan to investigate further these and related issues in the 
future. 


Acknowledgements: The authors would like to thank the 
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Abstract 


Experimentation aimed at determining the minimum- 
granularity at which variable-length SIMD operations 
may be decoupled into identical asynchronous MIMD 
streams for a performance benefit is reported. The exper- 
imentation is based on timing measurements made on the 
PASM system prototype at Purdue. The application used 
to measure and evaluate this phenomenon was matrix 
multiplication, which has feasible solutions in both SIMD 
and MIMD modes of computation, as well as in a hybrid 
of SIMD and MIMD modes. Matrix multiplication was 
coded in these three ways and experiments were per- 
formed which examine the tradeoffs among all of these 
modes. 


1. Introduction 


While extensive past efforts have dealt with analyti- 
cal and simulated performance analysis of SIMD and 
MIMD algorithms, computations, and machines, this work 
describes empirically-based research generated from 
experiments on a parallel machine. This research was 
performed in an attempt to gain insight into the effect of 
certain aspects of novel architectures on applications pro- 
grams. Specifically, the performance of the PASM proto- 
type, a machine capable of both SIMD and MIMD modes 
of computation, is evaluated from the perspective of 
matrix multiplication. This application was chosen 
because it has obvious optimal solutions and a simple 
enough structure to permit analysis of architecture 
features through controlled measurements of program 
execution time. The experiments described are based on 
SIMD, MIMD, and hybrid S/MIMD algorithms for multi- 
plying n x n matrices for values of n ranging from 4 to 
256. Operations were performed on 16-bit integers utiliz- 
ing 16 processors in several 4, 8, and 16 processor 
configurations. 


The primary architecture feature being evaluated in 
this work is the ability to decouple small grains of vari- 
able execution-time operations from SIMD sections of 
code into multiple asynchronous MIMD threads of con- 
trol. This unique feature derives from the ability to 
dynamically reconfigure the parallelism mode of PASM. 


Results indicate that when mode-changing operations 
induce a minimal overhead, benefits of such decoupling 
may be found even for relatively small amounts of varia- 
tion in the execution-time of individual operations. This 
same low-overhead mode-changing feature was also used 
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SIMD and/or MIMD machines of various sizes [SiS81]. 


to greatly improve the performance of the inter-process 
communication components of parallel programs by using 
the implicit hardware synchronization of SIMD mode to 
reduce the complexity of message passing protocols 
through the PASM interconnection network. Finally, 
experiments indicate that due to the existence of finite 
queues for issuing instructions from the control units to 
the processing elements in SIMD mode, superlinear 
speed-up is achievable. (We define superlinear speed-up 
as the condition in which the speed-up to number of PEs 
(processing elements) ratio is greater than 1.) 


Section 2 briefly describes generally related work, 
and Section 3 overviews PASM and its prototype. Sec- 
tion 4 describes the basic algorithm that was used. while 
Section 5 describes the programmed variations of -this 
algorithm as implemented on PASM for use in the experi- 
ments presented in Section 6. In Sections 7 through 11, 
the empirical results are discussed under special con- 
sideration of the PASM architecture as well as the central 
issue of decoupling variable-length SIMD operations into 
multiple asynchronous MIMD streams. 


2. Background and Related Work 


Related experimental research has been carried out 
on several machines through the use of both simulation 
and experimental techniques. Simulation-based analysis 
was performed by Su and Thakore for the SM3 system 
and a hypercube architecture [SuT87|. Experimental work 
involving measurements on working machines has also 
been performed. Examples include work involving several 
machines: the BBN Butterfly [CrG85], Cm* [GeS87], the 
Encore Multimax [Hud88], the Intel Hypercube [Hud88], 
PASM [FiC87], and the Warp system [AnA87]. In these 
efforts, matrix multiplication was normally employed as 
an example algorithm. Other reported work involving 
efficiency measurements and algorithm optimization on 
parallel machines includes work done on an Alliant FX/8 
(JaM86, Han88], a CRAY XMP [Cal84], and a combina- 
tion of Apollo work-stations and an Alliant FX/8 
[KuN88g}. 


3. Overview of PASM and the PASM Prototype 


The PASM (partitionable SIMD/MIMD) system is a 
dynamically reconfigurable architecture in which the pro- 
cessors may be partitioned to form independent virtual 
A 
30-processor prototype has been constructed and was used 
in the experiments described in Section 6. This section 
discusses the PASM architecture characteristics which are 
most relevant to the reported experimentation. For a 
more general description of the architecture, see [SiS87]. 


The Parallel Computation Unit of PASM contains N 
PEs where N is a power of 2 (numbered from 0 to N—1), 
and an interconnection network. Each PE (processing 
element) is a processor/memory pair. The PE processors 
are sophisticated microprocessors that perform the actual 
SIMD and MIMD operations. The PE memory modules 
are used by the processors for data storage in SIMD mode 
and both data and instruction storage in MIMD mode. 
The Micro Controllers (MCs) are a set of Q=2% proces- 
sors, numbered from 0 to Q—1, which act as the control 
units for the PEs in SIMD mode and orchestrate the 
activities of the PEs in MIMD mode. Each MC controls 
N/Q PEs. PASM has been designed for N=1024 and 
Q=32 (N=16 and Q=4 in the prototype). A set of MCs 
and their associated PEs form a virtual machine. In 
SIMD mode, each MC fetches instructions and common 
data from its associated memory module, executes the 
control flow instructions (e.g., branches), and broadcasts 
the data processing instructions to its PEs. In MIMD 
mode, each MC gets instructions and common data for 
coordinating its PEs from its memory. 


Fetch Unit 
Controller 


Fetch Unit 


l 
! 
Dc a aca ea a te ee ee Sy a as ee a 


Figure 1: Simplified MC structure. 


The PASM prototype system was built for N16 
and Q=4. This system employs Motorola MC68000 pro- 
cessors as PE and MC CPUs, with a clock speed of 8 
MHz. The interconnection network is a circuit-switched 
Extra-Stage Cube network, which is a fault-tolerant vari- 
ation of the multistage cube network. Because knowledge 
about the MC and the way in which SIMD instructions 
are implemented with standard MC68000 microprocessors 
is essential to the understanding of the behavior that was 
observed in the experiments, the SIMD instruction broad- 
cast mechanism is overviewed below. 


Consider the simplified MC structure shown in Fig- 
ure 1. The MC contains a memory module from which 
the MC CPU reads instructions and data. Whenever the 
MC needs to broadcast SIMD instructions to its associ- 
ated PEs, it first sets the Mask Register in the Fetch 
Unit, thereby determining which PEs will participate in 
the following instructions. It then writes a control word 
to the Fetch Unit Controller which specifies the location 
and size of a block of SIMD instructions in the Fetch Unit 
RAM. The Fetch Unit Controller automatically moves 
this block word by word into the Fetch Unit Queue. 
Whenever an instruction word is enqueued, the current 
value of the Mask Register is enqueued as well. Because 
the Fetch Unit enqueues blocks of SIMD instructions 
automatically, the MC CPU can proceed with other 
operations without waiting for all instructions to be 
enqueued. 
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PEs execute SIMD instructions by performing an 
instruction fetch from a reserved memory area called the 
SIMD instruction space. Whenever logic in the PEs 
detects an access to this area, a request for an SIMD 
instruction is sent to the Fetch Unit. Only after all PEs 
that are enabled for the current instruction have issued a 
request is the instruction released by the Fetch Unit 
queue, and the enabled PEs receive and execute the 
instruction. Disabled PEs do not participate in the 
instruction and wait until an instruction is broadcast for 
which they are enabled. This way, switching from MIMD 
to SIMD mode is reduced to executing a jump instruction 
to the reserved memory space, and a switch from SIMD 
to MIMD mode is performed by sending a jump to the 
appropriate PE MIMD instruction address located in the 
PE main memory space. 


The SIMD instruction broadcast mechanism can also 
be utilized for barrier synchronization [LuB80] of MIMD 
programs. Assume a program uses a single MC group, 
and requires the PEs to synchronize R times. First, the 
MC enables all its PEs by writing an appropriate mask to 
the Fetch Unit Mask Register. Then it instructs the 
Fetch Unit Controller to enqueue R arbitrary data words, 
and starts its PEs which begin to execute their MIMD 
program. If the PEs need to synchronize (e.g., before a 
network transfer), they issue a read instruction to access a 
location in the SIMD instruction space. Because the 
hardware in the PEs treats SIMD instruction fetches and 
data reads the same way, the PEs will be allowed to 
proceed only after all PEs have read from SIMD space. 
Thus, the PEs are synchronized. The R synchronizations 
require R data fetches from the SIMD space. Thus, the 
Fetch Unit Queue is empty when the MIMD program 
completes, and subsequent SIMD programs are not 
affected by this use of the SIMD instruction broadcast 
mechanism. 


In order to make comparisons of the speed of the 
PASM prototype relative to other machines and to com- 
pare the relative speeds of SIMD and MIMD instruction 
fetches, the actual raw performance of PASM in SIMD 
and MIMD mode was measured on the prototype and is 
illustrated in Table 1 in MIPS (millions of integer instruc- 
tions per second) for two different types of instructions. 
The difference in speed between SIMD and MIMD modes 
can be attributed to two factors. SIMD instructions are 
fetched from the Fetch Unit Queue in the MC, and the 
queue can deliver data with one less wait state than can 
the PEs’ main memories. In addition, PEs’ main 
memories are implemented with dynamic memories. 
While care was taken in the hardware design that all 
refresh operations occur simultaneously in all PEs, and 
are performed invisible to the PE CPU, some delay is still 
possible. No such delay occurs during SIMD instruction 
fetches because the Fetch Unit queue is implemented with 
static RAM components. Measurements were made with 
repeated blocks of straight line code which were large 
enough to make the loop control overlap insignificant. 


4. Matrix Multiplication Algorithms Used 


The parallel matrix multiplication algorithm used 
here had O(n*/p) time and space complexity for multiply- 
ing two nxn matrices employing p PEs. Figure 2 shows 
an O(n?) time and space complexity serial algorithm. 
This particular algorithm is provided to illustrate the ord- 
ering of multiplications as they are done in the parallel 
version of Figure 3. Figure 4 demonstrates the progress 
of the serial algorithm for n=4. The two data-flow 
graphs illustrate what occurs during the first two itera- 
tions of the second 7 loop of Figure 3. The 7 loop of the 


Table 1: Prototype raw performance. 


Processing 
Mode Operation Rate 
SIMD 


16-bit Reg.-to-Reg. add 22 MIPS 
MIMD 16-bit Reg.-to-Reg. add 18 MIPS 
16-bit Reg.-to-Mem. add 6.4 MIPS 


16-bit Reg.-to-Mem. add 6.0 MIPS 


serial algorithm simulates the PE number in the parallel 
algorithm. The calculation of ((i+j) mod n) in the serial 
version allows the rows of the B matrix to be stepped 
through as the 7 loop proceeds with the initial B matrix 
row number being 7. The serial algorithm used in the 
measurements on PASM, however, was optimized in 
order to permit accurate evaluation of speed-up, and 
therefore did not perform multiplies in this columnar 
manner. Rather, it followed a more straightforward row- 
column order. 


for i=0 to n—1 do 
for j=0 to n—1 do 


Cj =0; 
for i=0 to n—1 do 
for j=0 to n—1 do 
for k=0 to n—1 do 


Chi = Ck,i 1 Ak, ((i+j) mod n) ((i+j) mod n),i3 


Figure 2: Serial matrix multiplication algorithm. 


for alli, OSiSn—1, do 
for v=0 to (n/p)—1 do 
m(v)=Vv; 
for j=0 to n—1 do 


Ci y = 90; 
for jd to n—1 do 
for v=0 to (n/p)—1 do 
for k=0 to n—1 do begin 
Chy = Cky + Ak r(v b((i(n/p)+v-+j) mod n ,v? 
for v=I to (n/p)—1 em ) ((i(n/p)+v+)) ) 
[change the pointer to column v—1 of the A 
matrix to point to column v*| 
m™(v—1)=7(v); 
for k=0 to n—1 do 
send a, 7(9) to PE (i—1) mod p; 
receive a value and move it into ay x((n/p)—1)3 


Figure 3: Parallel matrix multiplication algorithm. 


In the parallel algorithm, the outer for all loop 
represents iteration across space rather than time. Each 
PE contains n/p adjacent columns of each matrix as 
shown in Figure 5. Within each PE these columns are 
numbered from 0 to (n/p)—1 as shown in the algorithm of 
Figure 3. 7 is a vector of indices to the n/p columns of A 
(in the actual implementation address pointers are used 
for efficiency). This layout is similar to that used by Su 
and Thakore in their experiments for the SM3 system 


_* This effectively rotates all internal columns of 
the A matrix to the left without destroying the data 
in column 0 of the PE, or actually moving the data. 
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Doo X a0 + Coo = Coo 
bio X a9 + Cio = Cio 
boo X a9 + Cop = Cag 
bao X a39 + C39 = C39 
(a) 
boo X ag1 + Coo = Coo 
bio X ay + Cio = Co 
bao X 91 TF Coo = Cao 
b39 X 431 + C39 = C39 


(b) 


Figure 4: Two iterations of the serial algorithm for n=4. 
(a) i=0, j=0, OSk3S3. (b) i=0, j=1, 02kS3. 


Figure 5: Data Layout for n=8, p=4. 


[SuT87]. Using the for v loop, each of these adjacent 
columns is stepped through by each PE in sequence, and 
each PE appears as if it has n/p virtual PEs within it. 
The virtual PE number is then defined as (n/p)i-+v. 
Thus, the row subscript of B is calculated by replacing 7 
in Figure 2 with this virtual PE number. Data movement 
internal to each PE involves only a pointer adjustment. 
Only on the boundaries of the A matrix (i.e. the highest 
and lowest numbered columns of each PE) is the inter-PE 
network employed. 


This particular algorithm was chosen over a more 
standard parallel matrix multiplication algorithm (e.g., 
see Stone [Sto80]) for several reasons. First, if a broadcast 
approach is used to distribute the "a' coefficients to the 
PEs, p network set-up cycles are incurred in addition to 
n” network transfer cycles (in the course of the algorithm 
each PE will have to broadcast its A matrix values (hence 


p settings) and the whole A matrix will have to be broad- 


cast (hence n” transfers)). In the chosen algorithm, the 
network remains in one configuration (i.e., PE i connected 
to PE (i—1) mod p), thus eliminating any recurring net- 
work set-up costs, while not incurring any additional net- 
work transfer costs. Also, this algorithm facilitates a 
columnar data format which was preferable for several 
reasons. First, because all matrices are stored in colum- 
nar format, BxA may be calculated as well as AxB 
without rearrangement of the data. Second, each matrix 


may be used in subsequent multiplications without refor- 
matting. Data uniformity is also desirable to facilitate 
parallel I/O transfers of large data sets from secondary 
memory. | 


What follows is a semantic description of the pro- 
gress of the algorithm. During each of the n? /p iterations 
of the innermost loop of the algorithm shown in Figure 3, 
each of the elements of the columns of the A matrix is 
multiplied by an element of the B matrix. (Note that due 
to the columnar storage, the column of the B matrix 
matches the internal column number of the A matrix. 
The absolute row number of B will match the absolute 
column number of the A matrix.) This value is then 
added to an element of the C matrix. Therefore, there is 
a total of n multiplications and additions in the inner 
loop with this loop being executed n/p times. In the final 
k loop, the columns of the A matrix are shifted one 
column to the left. Within each PE, this transfer involves 
a single memory move, because a pointer to the entire 
column is changed rather than moving its elements. 
However, for the lowest numbered column of each PE, 
the transfer employs the interconnection network. This 
column is transferred through the network and stored in 
the highest numbered column of PE ({i—1) mod p). The 
data received through the network is placed in the PEs 
memory as its highest numbered column. This transfer 
requires n network word operations (one for each element 
of the column). This procedure is repeated until all of the 
columns of the A matrix have been through each of the 
(n/p) positions of each PE for a total of n® network word 
transfer operation times incurred. During each of these 
elemental time periods, p values are.exchanged. 


Consider the time required for index calculation. 
The constant ix(n/p) was pre-calculated and placed in the 
programs data segment because it was constant in each 
PE for a given value of n and p. Also, the j+v operation 
involved in the B matrix row calculation was done outside 
the k loop and therefore only contributes O(n) time com- 
plexity per PE. The calculation of the A and C matrix 
row indices was done with the MC68000’s auto-increment 
mode. Due to the pipelined structure of the MC68000 
this does not add any extra execution time to the non- 
autoincrement mode. Therefore, the index calculation, as 
a separate component of the execution, time is not 
significant. 


The current implementation of the network in 
PASM supports 8-bit data transfers. Because these exper- 
iments involved 16-bit data, each element transfer 
required two shift operations (one for transmitting and 
one for receiving), an OR operation, and two network 
operations. Because no DMA block transfers were possi- 
ble given the current implementation of PASM, each 
column transfer required n single-element transfers for a 
total of 2n network operations per column. 


Being circuit switched, setting up a path in the 
PASM prototype network is a time consuming operation. 
However, in this algorithm only a single path set-up is 
required, (ic. PE i always sends to PE (i—1) mod p). 
Thus the measurements made do not reflect any 
significant influence from network reconfiguration over- 
head. Hence, there were 2n” network accesses, n° /p mul- 
tiplications, and n°’ /p additions required. This resulted in 
a O(n* /p) growth in execution time. 


5. Implementations of the Algorithm 


Three variations of the parallel algorithm, as well as 
an efficient serial version, were programmed in MC68000 
assembly language for execution on the PASM prototype. 
The parallel versions included a pure SIMD, a pure 
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MIMD, and a hybrid S/MIMD version. These three pro- 
grams may be executed on 4, 8, or 16 processors simply 
by changing variables embedded in their data sections. 


5.1. SIMD 


The SIMD version executes all looping and control 
flow instructions in the MCs. Arithmetic, data move- 
ment, and index calculation instructions are executed on 
the PEs in SIMD mode. The PE instruction stream is 
obtained through the MC’s Fetch Unit Queue and is exe- 
cuted synchronously on all PEs. 


In PASM, the network appears to the PEs as two 
memory locations (transmit and receive registers). Net- 
work transfers are made directly to the transfer registers 
using memory-to-memory move instructions. 


For several reasons, the SIMD version appeared to be 
the most natural choice for implementation. First, in the 
matrix multiplication algorithm used all PEs are always 
enabled, thus eliminating the need for enabling and disa- 
bling the PEs. Second, the implicit synchronization 


inherent in SIMD mode allowed the network transfer 


operations to be carried out in a straightforward fashion 
requiring only two memory-to-memory move instructions. 
Third, the only data-dependent portion of the algorithm 
is the actual multiplication instruction, which has a vari- 
able execution length due to its microcoded implementa- 
tion in the MC68000. A final advantage of the SIMD ver- 
sion is due to the use of a FIFO queue in the Fetch Unit 
of the MCs. Because this queue buffers instructions being 
sent to the PEs, the execution of SIMD instructions by 
the PEs can be overlapped with the execution of control 
flow instructions by the MCs. 


In addition to these conceptual factors involved in 
the SIMD version, there are some factors that were 
present due to the implementation of the PASM proto- 
type. First, instructions may be accessed more quickly 
from the Fetch Unit Queue than from the PEs main 
memory. This is due to the use of faster memory technol- 
ogy in the queue. Also, the overlap of the control flow 
instructions with PE instructions is only present if. the 
queue remains non-empty. In other words, the PEs can 
only proceed if the MCs supply instructions faster than 
the PEs can remove them from the queue. 


5.2. MIMD 


The second version was a pure MIMD program in 
which the MCs were only used for initiating the PE pro- 
grams. The PEs executed all instructions asynchronously 
including all network, control flow, and arithmetic opera- 
tions. Although the network hardware prevents overwrit- 


ing of old data in the transfer register, the asynchronous 


network operations necessitated polling of the network 
buffer in order to determine whether it was ready to 
accept new data. After transmission, the network buffer 
must be polled to assure that the data is valid before a 
receive operation can be completed. 


The major advantage of the MIMD version was 
rooted in the variation of the execution time of the 
MC68s000 multiply instruction. Multiply or divide 
instructions require an amount of time which is related to 
the number of 1’s in the binary representation of one 
operand. Assume an algorithm is executed on K PEs, 
each PE executes J instructions, and instruction j on PE 
k takes time 7;,. Then the total execution time in SIMD 
mode (7s1mp) is the sum of the worst case times for each 
instruction as given by: 


J xK 
TSIMD = 2g XT jk 
jai = 


In MIMD mode each PE proceeds independently, and 
therefore the execution time (7,y~yp) is the worst case sum 
of instruction execution times as given by: 
TMIMD = Max Y) Tik 
k=0 . 
j=l 
In general, Tyg¢mp SS Tsim - 


5.3. S/MIMD 


The hybrid S/MIMD algorithm was developed to 
take advantage of the fast barrier synchronization 
mechanism described in Section 3 and to exploit the exe- 
cution time advantage of the MIMD program (i.e. decou- 
pling at low cost). In this version, the main program was 
the same as in the MIMD case. The difference was in the 
method of determining whether the network was ready to 
accept a transfer operation. Rather than polling the net- 
work buffer, barrier synchronization was used to allow 
network operations to be carried out as simple memory- 
to-memory move operations as in the SIMD version. This 
lowered the amount of network overhead to a level com- 
parable but slightly greater than the SIMD version due to 
the mode switching time. The other advantages of SIMD 
mode (i.e., faster instruction fetch and control flow 
instruction overlap) could not be realized in this version. 


6. Experiments Performed 


Experiments were performed on nxn matrices and 
measurements were made of the execution times for n= 
4, 8, 16, 64, 128, and 256. The algorithm was imple- 
mented for SIMD, MIMD, and S/MIMD mode and was 
run on p = 4, 8 and 16 PEs. All operations were 16-bit 
unsigned integer operations and overflow was ignored. To 
allow for varying machine and problem size, loops were 
utilized wherever possible. 


To measure the amount of asynchronous execution 
necessary to yield better performance by the hybrid ver- 
sion over the SIMD version, the number of multiplies in 
each innermost loop of the algorithm was made to be a 
dependent variable. These multiplies were added as 
straight line code in order to prevent skewing of execution 
time data due to control flow overlap. The multiplies 
were added to study the effect on the total execution time 
and did not affect the values in the C matrix. Let Tsp 
and T’s/mrmp be the total execution time for the SIMD 
and S/MIMD programs respectively. The performance of 
each of the components of the execution time was meas- 
ured at points corresponding to quantities of inner loop 
multiplications where: 


Tsmp < Tsim; 
Tsp = Ts/imp, and 
Tsimp > Ts/mimp- 


Measurements were made with the internal system 
timers (MC68230). Experiments were performed for each 
version with the identity matrix in A and random data in 
B. While the value of the multiplier used in the MC68000 
multiplication instruction affects the execution time, the 
data value of the multiplicand has no effect. Therefore, 
the elements of the A matrix, which were always used as 
the multiplicand could be chosen as the identity matrix 
without affecting program performance. By using the 
identity matrix, matrix multiplication results could be 
easily verified, thereby simplifying the debugging process. 
Random data, produced from a uniformly distributed 
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random number generator, was chosen for these experi- 
ments in order to represent the average case, and the 
same data sets were used on all versions of the algorithm 
with the same value of n and p. 
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Figure 6: Execution time vs. problem size for p=8 
and one multiply per inner loop. 


7. Speed-up & Overall Comparison 


Figure 6 illustrates execution time of matrix multipli- 
cation vs. problem size observed in the parallel versions of 
the algorithm for p=8. The difference between the SISD 
time and that of the parallel versions represents an 
improvement by a factor of approximately p. 


Although not readily apparent in the graph, it 
should be noted that Tymnp/Tsymrmp decreases as n 
increases. The only difference between these two versions 
is attributed to the contribution to the execution time of 
communication. Note that for p fixed, and small n (e.g. 
n=8), the time complexity of the multiplications is 


n n“ (8 a , 
—— or n” (8) =n’, This is the same order of contribu- 


tion as communication. Hence, for small n, the O(n’) 
communication contribution dominates the O(n*) arith- 
metic. However, for larger n, the O(n?) component ulti- 
mately dominates and all three curves converge. 


The third aspect of this graph is the apparent advan- 
tage of the SIMD version over the S/MIMD version. The 
difference is caused by the ability of the MCs to execute 
control flow in parallel with arithmetic. However, the 
S/MIMD version has the potential for better performance 
due to the decoupling effect associated with MIMD execu- 
tion of data-dependent execution time operations. In 
order to determine the point where these graphs cross, 
however, experiments were conducted which added more 
data-dependent instructions in a controlled way. 


8. Execution Time vs. Number of Variable Length 
Operations 


To determine the amount of asynchronous execution 
needed to achieve a benefit when executing a portion of a 
computation asynchronously in MIMD mode, additional 
multiplication operations were added to the innermost 
loop of the algorithm. Figure 7 plots total execution time 
for SIMD and S/MIMD programs with added multiplica- 
tions vs. the number of added multiply instructions for 
n==64 and p=4 with random data. The lines plotted 
include 3 different points with the number of multiplica- 
tions ranging from 13 to 15. 
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Figure 7: Execution time vs. number of 
inner loop multiplications for 
n=64 and p=4. 


These lines are disjoint at the endpoints with the SIMD 
version being faster for small numbers of added multiplies 
and S/MIMD being faster as the number of added multi- 
plies is increased. The point at which Tsp = Ts pp 
was with approximately fourteen added multiplications. 
This was due to the increase in execution efficiency when 
the multiplications were executed asynchronously. i.e., 
fewer processors were idle while waiting for all multiplica- 
tions to complete. 


9. Contributions to Execution Time 


To further demonstrate that the execution time 
advantage was manifested in the multiplication instruc- 
tion execution time, the contributions of the total execu- 
tion time of the hybrid and SIMD programs were broken 
down and plotted. Figures 8, 9, and 10 contain plots of 
execution time vs. problem size at each of the endpoints 
and at the crossover point of Figure 7. 

The times shown are broken down into: (i) multiplication 
time, (ii) communication time, and (iii) other contribu- 
tions such as time. for clearing the C matrix and shifting 
pointers for internal data movement. Multiplication and 
communication times include related address calculation 
operations. The multiplication time also includes the 
addition operation required to add the calculated value to 
the proper C matrix element. Figure 8 shows clearly that 
as problem size increases the time required for the multi- 
plications increases faster than the communication time. 
This was mainly due to to the difference in the order of 
the communication time and the multiplication time (i.e. 
O(n’) vs. O(n? /p)). Due to this difference in time com- 
plexity, the time required for the multiplication instruc- 
tions becomes the largest component of execution time, 
even without the added multiplication instructions. The 
S/MIMD program, however, does not execute faster than 
the SIMD version due to both the control unit instruction 
overlap and the faster memory access time of the Fetch 
Unit Queue unless extra data-dependent instructions are 
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Figure 8: Contributions to execution time for 
matrix multiplication with one multiply 
per inner loop and p=4. 
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Figure 9: Contributions to execution time for 
matrix multiplication with 14 multiplies 
per inner loop and p=4. 
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Figure 10: Contributions to execution time for 
matrix multiplication with 30 multiplies 
per inner loop and p=4. 


added. 


In Figure 9, the execution times are equal at n=64. 
With the total time broken down, it is apparent that the 
matrix multiplication times are close for all values of n, 
and when n=64 the matrix multiplication time is less in 
the S/MIMD program than in the SIMD program. How- 
ever, the matrix multiplication time was the same because 
the communication time in the S/MIMD version was 
slightly more than in the SIMD version. Also, it should 
be noted that this effect would be greater if the constant 
value representing the instruction fetch time advantage 
were removed. 


Figure 10 demonstrates the advantage provided by 
the asynchronous multiplication instructions when enough 
were added to make the other effects diminish in impor- 
tance. In this version with 30 added multiplications per 
inner loop the S/MIMD version is faster for the larger 
values of n and this difference increases with n. 


10. Efficiency vs. Problem Size 


Figure 11 plots efficiency vs. problem size for the 
three modes of computation possible on PASM with p=4 
as well as the serial case where efficiency is defined as: 
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Figure 11: Efficiency vs. problem size for p==4 and one 
multiply per inner loop. 


The efficiency of the S/MIMD and MIMD versions 
increased with the problem size, and never reaches or 
exceeds unity. The reason for the increasing efficiency 
can be accounted for by the fact that the quantity of 
communication overhead increases as O(n“), and the com- 
putation increases as O(n’/p). The best efficiency was- 
96% for the S/MIMD version and 87% for MIMD version 
(for n=256 and no added multiplies). The MIMD 
efficiency was lower due to the extra overhead required 
for the MIMD communication. 


The SIMD version, however, was not only more 
efficient than the MIMD and S/MIMD versions, but was 
able to achieve an efficiency greater than unity when com- 
pared only to the number of PEs employed. This 
difference can be attributed to the ability of the PEs to 
do computation while the MCs are doing looping and 
other control operations. If the queue can remain non- 
empty and non-full at all times, it should be possible to 
eliminate all of the time required for the control opera- 
tions. Because this amount increases with n, it is not 
surprising that the benefit also increases with n. This 
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amount of benefit is related to the the ratio of control 
operations versus computation and communication opera- 
tions. This does, however, demonstrate that the overlap 
of control flow and computation is possible and does pro- 
vide some efficiency benefits — especially for applications 
that strongly exhibit a large quantity of control flow 
operations that can be performed on the MCs. This effect 
was predicted earlier by Kuehn et al in [KuS86]. 
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Figure 12: Efficiency vs. number of processors for n=64 
and one multiply per inner loop. 


11. Efficiency vs. Number of PEs 


Figure 12 shows how efficiency drops as the number 
of processors utilized increases. This drop in efficiency is 
due to several factors. First, the value of n/p drops as p 
increases representing a decrease in the amount of compu- 
tation done by each processor. While this does allow 
better parallelization of the algorithm, it makes the time 
consumed by inter-processor communication and other 
factors not present in the serial version become more 
significant compared to the time required by the compu- 
tation portion of the algorithm. 


12. Summary 


Experiments designed to examine the _ tradeoffs 
among the SIMD, SISD, MIMD, and MIMD with barrier 
synchronized modes on the PASM parallel processing sys- 
tem prototype were described. In particular, the effects of 
instructions with data dependent execution times were 
considered. Tests were coded and executed on the proto- 
type. Runtimes for different numbers of multiplies, 
numbers of processors, array sizes, and modes of parallel- 
ism were collected. This data was evaluated and dis- 
cussed, analyzing the effects of the various parameters in 
the tests. | 


The experiments presented used an actual parallel 
system and pointed out some of the trade-offs among 
these modes of parallelism. Experiments such as these on 
working prototypes are important in order to begin to 
learn how to effectively harness the power of parallel pro- 
cessing. | 
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ABSTRACT 


There exist two main schemes for data sharing among processing 
elements in multiprocessors: message passing in loosely coupled 
multiprocessors and shared memory in tightly coupled multiprocessors. 
However, the former has communication overhead and the latter has 
shared memory contention. In this paper, two Flexibly (Tightly/Loosely) 
Coupled Multiprocessors (FCMs) for image processing are proposed in 
order to alleviate these disadvantages. A variable space memory 
scheme in which a set of adjacent memory modules can be merged by 
a dynamically partitionable bus, is proposed to achieve the FCMs. 
These architectures are quantitatively analyzed and simulated on the 
Intel’s Personal SuperComputer (PSC), a hypercube multiprocessor. 
Parallel algorithms for region labeling and median filtering are 
simulated on the proposed architectures. The performance of the FCMs 
shows remarkable improvement over the existing hypercube machine. 


1, INTRODUCTION 


In existing multiprocessors, there are mainly two methods for 
data sharing among processing elements (PEs). The first one is the 
message passing scheme in loosely coupled multiprocessors, and the 
second one is the shared memory scheme in tightly coupled 
multiprocessors. The Cosmic Cube [1], iPSC, and Ncube are examples 
of loosely coupled multiprocessors and the PUMPS [2], PASM [3], and 
Ultracomputer [4] are examples of tightly coupled multiprocessors. 


However, both types of multiprocessors have major limiting 
factors towards speed-up and _ expansion. Loosely coupled 
multiprocessors have a communication overhead disadvantage due to 
message passing, whereas tightly coupled multiprocessors have a shared 
memory contention problem. Another limiting factor which is usually 
neglected, but is important, is the time to load input data and to unload 
Output data. In many researches, it is often assumed that the data to be 
processed are already in processing structures. In other words, the time 
to load and unload data is ignored. However, it is not negligible 
because input and output may take longer than computation time in 
some cases on existing multiprocessors. 


In order to alleviate these disadvantages (communication 
overhead, memory contention, and data loading and unloading 
overhead), and to achieve higher speed compared with that of existing 
multiprocessors, two Flexibly (Tightly/Loosely) Coupled Multiprocessors 
(FCMs) with a variable space memory scheme in which a set of 
memory modules can be merged by a dynamically partitionable bus are 
proposed. 


Most image processing tasks require a considerable amount of 
computation [5] which results from the large amount of data to be 
processed. The throughput of the system must be very high to mect 
these computational demands. Specifically, the throughput should be 
much higher for the real-time applications where a sequence of image 
frames is required to be processed. 


*This research was supported in part by IBM. 
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Parallel architectures for image processing can be classified into 
two groups in terms of functionality [6]: general purpose architectures 
and special purpose architectures. General purpose architectures are 
flexible and programmable for performing a broad range of 
applications. However, the desired performance often cannot be 
achieved due to the communication overhead, memory contention for 
the exchange of data and control information, and the complicated 
control strategies. Special purpose architectures can achieve better 
performance at the cost of flexibility and versatility. 


The characteristics inherent to image processing which warrant 
the parallel processing approach are now discussed. First, a whole 
image processing task can be decomposed into a set of subtasks which 
are sequentially applied to an entire image domain. For example, the 
task of object recognition is composed of several subtasks which 
include preprocessing, boundary’ detection, region labeling, 
normalization and finally matching. Second, a sequence of images is 
usually processed for real-time image processing. These 
temporal characteristics can be exploited by pipelining (temporal 
parallelism). Third, the entire image is subjected to the same operation 
which is performed pixel by pixel (e.g., histogram calculation) or 
region by region (e.g., median filtering). This spatial characteristic, i.e., 
the spatial locality of the image data, suggests that a whole image may 
be partitioned into subimages which can be processed in parallel by a 
set of PEs. The spatial characteristic can be exploited by array 
processing or multiprocessing (spatial parallelism). 


The rest of this paper is organized as follows. In the next section, 
two FCMs for image processing are introduced. These architectures are 
then quantitatively analyzed. The applications for SIMD (single 
instruction stream - multiple data stream) algorithms and pipelined 
pseudoparallel algorithms are described. The analytical model of the 
hypercube multiprocessor is described for comparison with the FCMs. 
In Section 3, the applications of the proposed architectures for image 
processing are described. Parallel algorithms for region labeling and 
median filtering are implemented and simulated on the hypercube 
multiprocessor. In Section 4, the experimental results and a discussion 
of the implementation and the simulation are presented for performance 
comparison. The FCMs are MIMD (multiple instruction stream - 
multiple data stream) machines which can exploit both types of 
parallelism for image processing. Some features of the FCMs proposed 
in this paper have several similarities to those of other architectures 
such as the VS [7], the SM3 [8] and the MP/C [9]. However, there 
exist several significant differences. Section 5 contains a discussion of 
the similarities and differences, and also some concluding remarks. 


2. TWO FLEXIBLY COUPLED MULTIPROCESSORS (FCMs) 


In this section, we describe the new architectures, and compare 
their features with those of the existing hypercube architecture. In 
general, the term tightly coupled is used for a multiprocessor which has 
shared memories, while the term loosely coupled is for a multiprocessor 
which has no shared memory. In contrast, as will be seen below, both 
terms may be used to describe the proposed architectures. 
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2.1 The FCM Model I 


2.1.1 Architecture 


The Flexibly (Tightly/Loosely) Coupled Multiprocessor Model I 
(FCM J) is shown in Fig. 2.1.1. The FCM I consists of N PEs (PE;), 
where N = 2”, N memory modules (M,), a control unit (CU) and a 
programmable I/O processor (IOP), where O<i<N-1. PE; is 


connected to the CU through the communication bus and to the M; 
through the dynamically partitionable address and data bus (A/D BUS) 
with the partitionable arbiter described below. PE; contains its own 
local memory for program and intermediate results. M; is used only for 
data. 


Fig. 2.1.1 The Flexibly Coupled Multiprocessor Model I (FCM I) 


A set of switches, S;, which can connect and disconnect the A/D 
Bus, is located between two memory modules, M; and M;,,;. All 
switches in S; are operated together (closed or opened together). The 
CU can handle each switch set §; independently. The IOP is directly 
connected to this memory and the CU. 


Any two successive memory modules (M; M;.,) can be formed 
into one contiguously addressable memory module with S; closed, in 
which case PE; and PE;,, can access the module through the arbiter 
one at a time. Moreover, a set of consecutive memory modules can 
become one contiguously addressable memory module with all switch 
sets between these modules closed. When all switch sets are closed, all 
memory modules become one contiguously addressable memory 
module. Thus, any PE, or the high speed IOP using the direct memory 
access (DMA) scheme, can access all memory modules. In this case, 
the FCM I becomes a fully tightly coupled multiprocessor. With all 
switch sets closed, if more than one PE tries to access the memory at 
the same time, memory contentions occur. To reduce memory 
contention, the CU opens some switch sets. If all switch sets are 
opened, then any PE (PE;) can access its corresponding module (M,) 
without memory contention. 


Accordingly, when all switch sets are open, the variable space 
memory scheme becomes N memory modules. Therefore, each PE can 
access only its own memory module. Thus, the FCM I becomes a 
fully loosely coupled multiprocessor because there is no more shared 
memory. When some switch sets are closed and the other sets are 
open, the network is grouped into a set of disjoint partitions. If all sets, 
except §;, are open, PE; and PE;,, are referred as partially tightly 
coupled, and other PEs are referred as partially loosely coupled. The 
variable space memory becomes N-1 disjoint memory modules. 
Accordingly, the network forms N—1 partitions. When all switch sets, 
except 5;, are closed, the variable space memory becomes two disjoint 
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12™ to (i+1)2"-1. Thus, m address lines (apo, . . 


memory modules, and the network forms two partitions. Each partition, 
PE;,O <j <i, and PE,, it1 < k < N-1, is partially tightly coupled. The 
partitions are partially loosely coupled with respect to each other. 
Thus, the FCM J is a flexibly coupled multiprocessor. 


To prevent memory contention in tightly coupled partitions, a 
partitionable arbiter within the A/D Bus is used between PEs and 
memory modules. One possible arbiter based on ripple carry logic is 
shown in Fig. 2.1.2 [10]. This arbiter can be modified to be a 
partitionable arbiter by using switches as shown. The lines G; are for 
requests, P; are for propagation of the requests from the previous 
stages, and C; signal whether the request has been granted in the 
previous stages. The switches in the arbiter are operated in conjunction 
with the switch sets in the A/D Bus. 


So (Closed) S; (Open) Sy-2 
Co [ eo0e . CN-1 
Co Po G] Py i PN-2 GN-1 PN-1 


Fig. 2.1.2 The Partitionable Arbiter 


2.1.2 The Variable Space Memory Scheme 


For a more detailed observation, the connection scheme of PEs 
(PE,;, PE;,,), memory modules (M;, M;,;), and a set of switches (S;) is 
shown in Fig. 2.1.3. PE; and PE;,, are connected to M; and Mj, 
respectively, via the partitionable A/D Bus. 5; is located between M; 
and M it+1- 
The address space of each memory module is 
K=2" (2.1.1). 


Each module M;, for 0 < i < N—1, contains consecutive addresses from 
. Am-1) are required for 
a module. Since there are N (= 2”) memory modules, the total address 
space is expressed by 


N*K = 2") (2.1.2). 


The total number of address lines is m+n, namely, (do, . . . mini): 
Thus, every item has a unique and absolute address so that it can be 
accessed by any PE or the JOP through the A/D Bus. The number of 
switches in a set is m+n+d+a, where d is the number of switches 
for the data bus, and a is the number of switches for the partitionable 
arbiter. The switches in a set can be operated in parallel by using the 
control bus, but any two sets may not be operated in parallel. 


Fig. 2.1.3 The Connection Scheme of PEs and Memory Modules 


: : : N pee 
In multistage interconnection networks, “F logaNV switching 


elements (2 X 2) are required, while only N—1 switching sets are 
required in the FCM I. The switches in the set S; are simple switches, 
unlike the complicated switching elements with relatively complicated 
control strategy used in a multistage interconnection network [11]. 


2.1.3 Communications 


There are three different types of communication: CU-to-PE 
communication (bi-directional), PE-to-PE communication and broadcast 
from the CU or from any PE. CU-to-PE communication can be 
achieved via the communication bus. PE-to-PE communication can be 
achieved via the communication bus or the variable space memory. 
The broadcast from the CU or from any PE can be realized by the 
communication bus in one cycle. For data sharing, the variable space 
memory can be used in tightly coupled partitions, and the 
communication bus can be used in loosely coupled partitions. Even in 
tightly coupled partitions, the communication bus can be used for 
message passing where messages are short (a few bytes). 


The variable space memory is advantageous for exchange of large 
amounts of data, while the communication bus is advantageous for 
exchange of small amounts of data. In order to combine two data in M; 
and M;,,, only the switching time to close the switch set S; is needed 
instead of message passing. 


2.1.4 Control 


The control scheme of the FCM I is simpler than the control of a 
multiprocessor which uses a multistage interconnection network. The 
role of the CU is to control PEs, all switch sets and the IOP. 


An example of the control schemes for SIMD mode processing 
on the FCM I, configured as an MIMD machine, is described below. 


The CU closes all switch sets and sends a signal to the IOP to 
load input data into a set of memory modules. 

The IOP can load input data by treating the collection of memory 
modules as a single module. After the IOP loads data, it sends 
the load completion signal to the CU. 

The CU opens all switches and broadcasts the task start signal to 
the PEs employed to execute their tasks. 


- 


it) 


iii) 
iv) During execution, data sharing is achieved via the variable space 
memory and also via the communication bus. The CU mediates 
data sharing, when the variable space memory is used. 

1) 
to the CU. 

When the CU receives all task completion signals from the PEs 


employed, it closes all switch sets and sends a signal to the IOP 
to unload output data. 


vi) 


As described above, PE-to-PE communications and CU-to-PE 
communications are not required to load and unload data, which is in 
sharp contrast to the hypercube machine described in 2.3.2. Instead, 
only the switching time and control time are required for the FCM 
machines. The CU-to-PE communication is used for control, not for 
data movement. In other words, the communications of the CU mainly 
consist of the control signals for PEs, switches, and the IOP. The 
control and synchronization schemes are simple. The CU issues control 
signals infrequently. Hence, the CU and the communication bus will 


After each PE finishes its task, it sends the task completion signal 
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not cause bottlenecks. This will be discussed in greater detail in 
subsection 3.1.2 and Section 4. Since the size of input and output data 
can be predicted in image processing tasks, the IOP can load and 
unload data selectively. For more general tasks, the terminator which 
indicates the end of input data or output data can be used by the IOP 
and PEs. This type of procedure is well suited for SIMD algorithms, 
such as median filtering, convolution, edge detection, FFT (fast Fourier 
transform), region labeling, and so on. 


2.1.5 Analytical Model for Image Processing 


In the FCM I, the parallel image processing time mainly consists 
of six time components: input data acquisition time (¢,,) to digitize the 
image data from an input device, such as a camera and to store the 
digitized image into the variable space memory by the IOP, 
computation time (¢,,), merging time to take the boundary consistency 
problem between subimages into account (f,,,), switching time to close 
and open switches (¢,,), control time to exchange control signals (¢,,), 
and an output transfer time (ty) to unload data to an output device. The 
parallel processing overhead includes t,, toy, and ten. 


top ANd ty, are functions of the size of the image (J) and the 
number of the PEs employed (N). f,, and ty are functions of { and are 
independent of N. t,, and ¢,, are functions of N and are independent of 
I, The time components may be expressed as follows 


=f, (2.1.3.a) 
top = folll, N), (2.1.3.b) 
tmg = 3, N), (2.1.3.c) 
to» = fal), (2.1.3.4) 
ten = F5(N), (2.1.3.e) 
ty = f(D. (2.1.3.6). 


The functions, fo, f3, f4 and fs, are very dependent on algorithms. 


As. discussed in the previous subsection, PE-to-PE 
communications are not needed for loading and unloading data. Only 
CU-to-PE communications are required for control. 


Thus, the total time to process an image frame on the FCM I can 
be expressed by 


Ty = bac + tep + bing + bow + ben + by (2.1.4). 
On a single PE; it may be represented by 
T, = tag + CNt gy + by 


(2.1.5) 
where c is a constant. . 


Speed-up (SP,) is defined as the ratio of the time on a single PE 
to that on the FCM I 


t..+cNt., +t 
“et (2.1.6). 
tac + toy + tmg + tow + ten + ty 


2.2 The FCM Model II with Buffering Capability 


This model is useful for MSIMD (multiple single instruction 
stream - multiple data stream) mode processing, which is composed of 
a set of SIMD mode processing. It is also useful for pipelined 
pseudoparallel algorithms [12] discussed in 2.2.3 as well as for SIMD 


-algorithms. 


2.2.1 Architecture 


The FCM with buffering capability (FCM II) is shown in Fig. 
2.2.1, There are two memory sets, M, and Mz. The set M, consists of 
N memory modules (Mao, Mai, °° * ,Man_1} and the set Mg consists 
of N memory modules {Mzgo, Mg;, °:* ,»Msn-i1}. The local memory 
in a PE can be used for storing program and intermediate results like in 
the FCM I. The IOP is connected to each memory set, with two ports 
for each set to load input data and unload output data from the 
modules, independently. 


Partitionable A/D Bus and Arbiter 


A/D Bus - 


Fig. 2.2.1 The Flexibly Coupled Multiprocessor Model II (FCM II) 


While PEs use M, for processing, the IOP can unload output 
results from and load input data into Mg. When PEs start processing 
Mg, the IOP can unload output data from and load input data into M,. 


The role of the CU in the FCM II is similar to that in the FCM I. 


The CU has three functions: to control the upper and lower switch sets 
separately, to control PEs, and to control the IOP. 


2.2.2 Analytical Model for Image Processing 


While PEs use M,, the CU sends a signal to the IOP to unload 
output data from Mg and load another input data into Mz, and vice 
versa. Thus f,, ty, and some portions of ¢,, and ¢,, for one set of data 
can be overlapped with f,5, tmg, and fs, and ¢,, for the other set of data. 
For example, during processing in M,, the time for sending a signal to 
the IOP from the CU can be overlapped with computation time. 
Therefore, tg, + ty + Pilsy + Palen can be overlapped with 
top + tmg + (1 — Pi)tsw + (1 — P2)fen, Where p; and p2 are the overlapped 
portions of ¢,,, and ¢,,. 


The total time to process an image frame for the FCM II (77) is 
expressed as follows 


Ty = tepttmet(1—pi)tewt(1—p2)ton ’ 
if topttmgt(1—pitswt(—Ppadten 2 tacttytPitsytPalcn (2.2.1). 
Ty = lagtl yt PilewtPalen » 


otherwise 


Speed-up (SP7;) is expressed as follows 


s 


Ty 


SPy = (2.2.2). 
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Control Bus 


2.2.3 Pipelined Pseudoparallel Algorithms on the FCM I 


Pipelined pseudoparallelism, where a_ serial algorithm is 
decomposed into a set of noninteractive independent subtasks so that 
parallelism can be used in each subtask level, is proposed in [12]. 


For an_ integrated computer vision § system, pipelined 
pseudoparallelism is valuable, wherein temporal parallelism and spatial 
parallelism can be exploited. Spatial parallelism and _ temporal 
parallelism may be achieved by using multiprocessing and pipelining, 
respectively. Many computer vision tasks, such as pattern recognition 
and dynamic scene analysis, fall in the category of pipelined 
pseudoparallel algorithms. 


Pipelined pseudoparallel algorithms can be implemented 
conveniently on the FCM II. The FCM II can be partioned into a set 
of various-size MIMD machines (or SISD, single instruction stream - 
single data stream, if the size is 1) by using the partitionable A/D Bus 
as described below. 


For the sake of convenience, the FCM II with N = 4 is illustrated 
in Fig. 2.2.2. S4; and Sg; denote that the switches are closed, and Sy; 
and S,; denote the switches as being open. Suppose that a whole task g 
is decomposed into three subtasks (g;, 22, 93), and g, needs two PEs 
(PEo, PE,) and each of gy and g3 needs one PE (PE, and PE;). For 
example, a pattern recognition task (g) consists of three subtasks which 
are preprocessing (g,), feature extraction (g2) and pattern classification 
(g3). Assume that a sequence of images is to be processed. For 
simplicity, the steps for switching and control between the CU, PEs and 
the IOP are omitted. 


Fig. 2.2.2 The FCM II (NV = 4) 


In the first phase, PEy and PE, (g,) process the image frame 
loaded by the IOP in Mao and M4, with S4o and S,4;. After processing, 
PE, and PE, write output into Myo and M,;. In the second phase, PE, 
(g>) processes the intermediate results in May and M4, with S4o and S4j, 
and writes output into Ma». In the third phase, PE (g3) processes the 
intermediate results in M4. with S,., and writes output into M43. 
Finally, the IOP accesses output results from M,3. In other words, g; 
reads input data from M,;; (Mg;-1), and writes output data into 
M,,; (Mz;). When i=0, input data is loaded from the IOP. The above 
steps are interleaved between two memory sets as described below. 


i) | While PE, and PE, (g;) process the image frame loaded by the 
IOP in Map and Ma, with Sao, PE (g2) processes the intermediate 
results generated by PE and PE, in Mgo and Mg, with Spo and 
Spi. Also, PE3 (g3) processes the intermediate results generated 
by PE> in Maz with S42. After processing, the processed frames of 
21, 82 and g3 are in (Mao and Mg), Mg, and M,g3, respectively. 
The IOP can directly access the final results in M43 with S42. 


ii) Similarly, while PE) and PE, (g,) process the image frame in 
Mgo and Mg,, PE (g2) accesses the previous output of g, in Mao 
and M,,, and write its output in M4. At the same time, PE3 (g3) 
accesses the previous output of g2 in Mgo. The IOP can directly 
access the final results in Mp3 with Spo. 


As described above, PEs can process the data in two memory sets 
M, and Mg, alternately. And the PEs of the current subtask write their 
outputs into corresponding memory modules which are used by the PEs 
of the next subtask. Thus, even data transfer time between stages 
which causes overhead in pipelined schemes can be reduced. During 
processing, there are only two connection patterns between PEs and the 
variable space memory modules shown in Fig. 2.2.3. There is no 
communication overhead. For synchronization, the stage transition 
between subtasks must be controlled by the CU. The processing time 
per output is max {t;}, where ¢; is the processing time of the i-th stage. 
Any composition of subtasks in pipelined pseudoparallel algorithms, 
may be mapped into the FCM II without communication overhead. 


Fig. 2.2.3 Two Connection Patterns between PEs and Memory Modules 


Hypercube Multiprocessor 


In order to compare the performance of the FCMs with a 
hypercube multiprocessor, the architecture and the analytical model of a 
hypercube are briefly discussed. 


2.3.1 Architecture 


The system is composed of a controller and a hypercube structure 
which consists of N = 2” PEs. Each processor is connected to its n 
nearest neighboring PEs. The hypercube multiprocessor is a loosely 
coupled multiprocessor with no shared memory and no _ global 
synchronization. Thus, data sharing between PEs is achieved by 
message passing [1]. 

There are two different types of communication: controller-to-PE 
communication and PE-to-PE communication. The communication 


required to perform an image analysis task causes significant overhead 
which degrades performance [13]. 


2.3.2 Analytical Model for Image Processing 


In the hypercube multiprocessor, the parallel image processing 
time mainly consists of six time components: input data acquisition 
time (f,,), input data distribution time (¢;,) from the controller to PEs 
which may consist of controller-to-PE communications (tas,,) and PE- 
to-PE communications (tas_,)> computation time (f,,), collection time to 
gather local results in PEs by using PE-to-PE communications (ter, ”? 
merging time to take the boundary consistency between subimages into 
account (tm,), collection time to send a whole result to the controller by 
using PE-to-controller communication (ton,_)» and output transfer time 


(ty). The parallel processing overhead includes tas, bas,» bmg f and 


cl,’ 
tole 

Each time component, except f,, and ty, is a function of both the 
size of the image (J) and the number of the PEs employed (/). ¢,, and 
ty are functions only of J. Therefore, the time components are 
expressed as follows 


456 


bac = yO), (2.3.1.a) 

tas = bis, + tas, = hI, N), (2.3.1.b) 
top = ha(I, N), (2.3.1.c) 

to = ter, + bet, = hI, N), (2.3.1.d) 
tme = hs(I, N), (2.3.1.€) 

ty = he(0). (2.3.1.4). 


It was found that the functions can be represented as 
h; = Cig + CI, + Cl, 7, where C;’s are constants, J, is the size of a 


subimage x and 2 <i <4 [14]. These functions are very dependent 


on algorithms. 


The total time to process an image frame on the hypercube 
multiprocessor (T;,) is represented by 


Th = bag + tas + bey + ter + bing + beg (2.3.2). 
Speed-up (SP,) is defined as the ratio of the time on a single PE 
to that on the hypercube multiprocessor 


bas ¥ CNte, + by 
Th tac + tas + bop + tet + tng + tp 


I 
SP, = (2.3.3). 


The time to load data into and unload data from the hypercube 


includes + 


awawwuwy 


consist Gf controller-i0-PE 
communications and PE-to-PE communications which may significantly 
degrade performance. As N increases, communication overhead 
increases, and efficiency sharply decreases. In contrast, to load and 
unload data on the FCM I, only ¢,,, ty and some portions of f,,, and f,, 
are needed. Furthermore, to load and unload data on the FCM II, even 
tac and ty can be entirely overlapped, and hence only some portions of 
toy and ¢,, are required. 


aes tas te and by tas and it, 


3. APPLICATIONS TO IMAGE PROCESSING 


We describe the simulations of some common image processing 
algorithms on the proposed architectures, and compare them with the 
implementation of these on the hypercube machine. We illustrate the 
advantages of the proposed models over the hypercube model for such 
algorithms. In the next section, we provide experimental results that 
demonstrate the advantages of the new architectures. 


Region labeling and median filtering are chosen for performance 
evaluation. Median filtering is used for preprocessing in many image 
processing tasks. Region labeling is one of the basic operations in 
image processing. Once an image has been partitioned into regions, 
these regions can be studied, described and possibly identified. 
Applications where region labeling plays an integral part include cell 
classification, military target detection, parts inspection, object 
classification and character recognition. 


3.1 Region Labeling 


A parallel algorithm for region labeling has been implemented on 
the iPSC d5 (32 PEs) and is described elsewhere [13]. The tracking 
method of Agrawala and Kulkarni [15] has been modified to overcome 
some of the limitations of the original scheme, and the merging 
algorithms for parallel implementation have been developed [13]. 


Two major limiting factors for speed-up on the hypercube 
multiprocessor were discovered. One is the PE-to-PE communication 
overhead, the other is the controller-to-PE communication overhead. 
The same algorithm can be also implemented on the FCMs without 
communication overhead. Only switching time and control time 
contribute to the overhead as described below. 


3.1.1 Parallel Algorithm 


The parallel algorithm consists of three major operations; 
preprocessing, labeling, and merging. The original image is partitioned 
into N equal-size subimages, which are distributed to the PEs involved. 
Each PE is assigned a subimage. Each subimage is preprocessed in a 
raster scan manner, i.e., top to bottom, left to right, generating a 
reduced representation of the image, i.e., the transition point 
representation (TPR) [15]. The transition point pair (TPP) shown in Fig. 
3.1.1 is composed of (XLk), XR(A)), where XL(k) and XR,(k), 
respectively, are the left point and the right point of the i-th region on 
scan line k. The TPP’s are obtained by preprocessing. After 
preprocessing, labeling operation is performed by using a set of 
boundary continuation conditions [13]. The local results are combined 
recursively through the pseudo binary tree structure described in the 
next subsection. It is necessary to merge two adjacent subimage 
regions with different labels into one region with the same label. The 
merging is performed first across vertical subimage borders 
(vertical merging) and then across horizontal subimage borders 
(horizontal merging). Finally, one PE labels regions based on merged 
TPP’s. Using the algorithm, any digital image represented in a 2- 
dimensional array can be labeled. Further details about algorithms are 
described in [13]. 
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Fig. 3.1.1 Regions and TPP’s for scan line k 


3.1.2 Simulation on the FCM I 


A. Pseudo Binary Tree 


A pseudo binary tree (refer to Fig. 3.1.2 for a 4-level tree) is a 
binary tree structure which can be easily embedded into the hypercube 
topology such that a node in the hypercube may represent more than 
one node in a corresponding pseudo binary tree [14]. The pseudo binary 
tree is an efficient topology for distributing subimages and collecting 
the local results in the hypercube multiprocessors. The reason is that all 
PEs in the hypercube are utilized for the pseudo binary tree 
implementation while at most only half of PEs are utilized for a binary 
tree implementation. The pseudo binary tree can be embedded into 
not only the hypercube topology but also the FCM I and II. 


Level 
0 


0 


5 6 
Fig, 3.1.2 The Pseudo Binary Tree 


1 2 3 4 


B. Embedding the Pseudo Binary Tree into the FCM I 


The pseudo binary tree can be used for merging procedure on the 
FCMs. It is not required for distributing and collecting data. For the 
convenience of illustration, a 4-level pseudo binary tree shown in Fig. 
3.1.2 is used. The embedding the pseudo binary tree into the FCM I 
for merging procedure is shown in Fig. 3.1.3. The procedure is 
described below. 
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Fig. 3.1.3 Embedding the Pseudo Binary Tree into the FCM I 


When the switch set Sg is closed, two memory modules, 
(Mo, M,), become one contiguously addressable memory module 
where the parenthesis represents one contiguously addressable 
memory module. PE) can access the modules like one memory 
module. Similarly, when S,, S, and Sg are closed, (M2, Ms), 
(M,, Ms) and (M6, M7) become three disjoint memory modules. 
‘PE,, PE, and PE. are able to merge the local results. This 
procedure realizes the transition from level 3 to level 2 in the 4- 
level pseudo binary tree. 

In addition, when S, and Ss; are closed, PE and PE, can merge 
the local results in (Mo, M,, M2,M3) and (M4, Ms, Me, M7), 
respectively. 

Finally, PE g can =~merge- all iocal_ results’ in 
(Mo, M,, M2, M3, Ma, Ms, Me, M7) with S3 closed. In this case, 
all switch sets are closed. 


i) 


it) 


iil) 


As shown above, there is no communication overhead (no 
message passing) that can significantly limit speed-up as on the 
hypercube multiprocessor. In addition, there is no memory conflict 
because only one PE in each partition performs the merging procedure. 
Instead there exist only switching and control times. The steps for the 
FCM II in each memory set are the same as those for the FCM I. 
Thus, the pseudo binary tree may also be embedded into the FCM II. 
In addition, general tree topologies may also be embedded into the 
FCMs. The embedding procedure for general tree topologies into the 
FCMs is similar to that for the pseudo binary tree. 


C. Control Algorithm for Parallel Region Labeling on the FCM I 


The control algorithm for region labeling executed in the CU is 
described in Fig. 3.1.4. Consider PE,. Let the binary representation of 


p be (4,1 *** Gj; *** do) where a;= 0,1. Consider the switch set S,. 
Let the q be denoted by the binary _ representation 
(On1 *** 5 -+- bo) where 6;=0,1. NJ is the number of image 


frames to be processed and N is the number of PEs employed. 


begin 
close all switch sets; 
for i — 0 until NJ — 1 do 
begin 
if (i = 0, i.e., the first frame) then 
send a signal to the IOP to load the first image frame; 
else 
send a signal to the IOP to unload output results and 
load another frame; 
repeat wait until the CU receives the load completion 
signal from the IOP 
open all switch sets; 
broadcast the start signals to PEs to execute tasks; 
while the number of signals received from PEs # N do 
wait; 
/* Implementation of the Pseudo Binary Tree */ 
NS <— N; 
for j <— 1 until log N do 
begin 
close the switch set S, where q is all possible 
values of (6,.1,.. >) obtained with the j least 
significant bits set to zero, 1.€.; 
bo = = b; ay imal bey = 0; 
send the start signals to "PE, to merge two local 
results, where p is all "possible values of 
(n-1, * 
bits set to Zero, 1.€., 2g = Q, = °° 


. Qo) obtained with the j least significant | 


° 1=Y,; 
while the number of signals seteived gaan PEs # 


NS do 


wait; 
NS <— NS > 1; /* one bit right shift */ 
end 
end 
end 


Fig. 3.1.4 The Overall Control Algorithm executed in the CU 


D. Time for Switching and Control on the FCM I 


As described before, the switches in a set can be operated in 
parallel by using the control bus, but two sets can not be operated in 
parallel. The time for operating one set of switches is represented by ¢,. 
The control signal can be issued by the CU, PEs and the IOP. The time 
for sending a control signal or for broadcast is represented by tf. The 
control procedure described in the previous subsection is used for 
estimating the time for switching and control. 


To send a signal to the IOP to unload output results and load 
another image frame, f, is required. After loading, the IOP sends the 
completion signal to the CU, thus f2 is also required. When the CU 
opens all switches in all sets, (N—1)t, is needed. When the CU 
broadcasts the start signal to all PEs to execute their tasks, ft, is needed. 
After processing, all PEs send the task completion signal to the CU, it 
takes Nt,. The merging procedure using the pseudo binary tree was 
described in subsection 3.1.2.B. To combine two local results, the 


switches in the set between two modules should be closed. Hence, = 

switching times are needed for the bottom level in the pseudo binary 
tree (level log,NV). Then the CU sends the start signal to = PEs to 
execute the merging procedure. After merging, | PEs send the 


completion signal to the CU. Accordingly, a switching times and 


2 a control signals are required. For the next bottom level (level 


logaN — 1), |—| switching times and 2 G control signals are needed. 


Since the a of the pseudo binary tree is log,N, this procedure is 
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repeated log,N times. Thus, the total number of switching times for 
| 2N 


log log2N 1 
merging procedure is N }) aI and it takes IN >> 5 | IA The total 
i=1 i=1 . 


number of control — for merging procedure is 2N >) |— 


=! 
log,N 1 
takes |2N }) |=| |h. 
mi (2 | 
Therefore, the total time for pire in region labeling is 
logoV 1 
low = (N-1)t, + |N a t; = 2(N-1)t, 


and the total time for control is 


logoN 
ton = 3t2 + Ntp + |2N ¥; 
i=1 


[I 
, and it 


(3.1.1). 


Bl ty = 3ty + Nty + 2(N-1)ty (3.1.2). 


As will be seen in Section 4, these times are much smaller than 
computation time. Hence, bottlenecks due to the CU, the 
communication bus and the control bus are minimal. 


3.1.3 Simulation on the FCM Il 


Since the FCM II has a buffering capability, another image frame 
can be sent to the currently unused memory set even before the 
previous image has been processed. In other words, we may try to 
overlap in time the data acquisition and the data transfer with the 
computation between successive images. This could improve 
performance especially when the data acquisition time and the data 
transfer time are considerably large. In contrast, there can be no overlap 
of the procedures in the FCM I. The former is referred to as an 
overlapping method while the latter as a non-overlapping method. 


The pseudo binary tree can be also embedded into the FCM II. 
The control algorithm in the FCM II is similar to that of the FCM I 
described in 3.1.2.C. While. PEs process the data in one memory set, 
the CU sends a signal to the IOP to unload output results of the 
previous frame from the other set and load another frame into the set (it 
takes t. for control). After the CU receives the load completion signal 
from the IOP (it takes also f,), the CU opens all switches in all sets (it 
takes (V—1)t,), and wait until PEs finish the current frame. Therefore, 
the time for switching, (V—1)t,, and the time for control, 2, can be 
overlapped with computation time. Thus, the total switching time and 
control time are reduced as follows: 


log,N 
n= WE » c I}: = (N-1)t, 


log,N 
ton = to + Ntp + 2N >) 
=] 


(3.1.3) 


ale = ty + Nt + 2(N-1)h (3.1.4). 


3.1.4 Implementation on the Hypercube Multiprocessor 


The system is composed of a controller and a hypercube 
containing 32 PEs (the Binary 5-cube). The controller acquires an 
input image from an input device, and divides it into a set of equal size 
subimages, and sends them to the hypercube through the pseudo binary 
tree described in 3.1.2.A. After distribution, every PE processes a 
subimage concurrently. On completing, the local results are sent to the 
higher level PEs in the pseudo binary tree for collecting and merging. 
They are merged across vertical or horizontal borders through the 
pseudo binary tree until a PE obtains a whole result. The total number 
of vertical and horizontal merging steps is log2N where N is the number 


of PEs employed. The labeled image is sent back to the controller. 
Finally, the controller sends it to an output device. As described above, 
there are six steps in this procedure; acquisition of an image, 
distribution of an image to PEs, parallel computation, collection of 
local results, merging for boundary consistency, and transferring output 
result to an output device. There exist controller-to-PE communications 
and PE-to-PE communications for distribution and collection. We used 
a scheme for distribution of data, i.e., the modified singlecast scheme in 


which the controller distributes a set of subimages to PEs on a certain © 


level in a pseudo binary tree [14]. Every PE which receives a 
subimage divides it again into two subimages, sends one subimage to 
its child PE in the pseudo binary tree recursively until all leaf PEs 
receive their subimages. 


3.2 Median Filtering 


Median filtering is a neighborhood operation, which transforms 
the value of each pixel to a new value calculated from its neighboring 
pixels (the median of a 3 x 3 window is used). Convolution, edge 
detection, and smoothing are other examples of neighborhood 
operations. To avoid unnecessary communication due to the data 
partitioning, the overlapped partitioning method shown in Fig. 3.2.1 is 
used. In this case, merging time (f,,,) can be avoided. Therefore, the 
pseudo binary tree which is used only for the merging algorithm in 
region labeling, is not needed to implement median filtering on the 
FCMs. However, it is needed to distribute and collect data on the 
hypercube multiprocessor. The control algorithms of the FCMs are 
simpler than those of region labeling because there is no merging 
procedure, namely, no pseudo binary tree. Since the control steps for 
the pseudo binary tree are not required, the total times for switching 
and control on the FCM I are derived from equations (3.1.1) and (3.1.2) 


bow = (N-1)t; (3.2.1) 

ton = 3tz + Ntg (3.2.2). 

As described in 3.1.3, (N—1)t; in ¢,, and 2t, in ¢,, can be 
overlapped with ¢,, on the FCM II. Therefore, on the FCM II the total 


time for switching can be totally overlapped, and the total time for 
control is 


ee _ 2) + Nt 


(3.2.3). 


Subimage 
for PE; 


Subimage 
for PE. 


i+l 


Fig. 3.2.1 The Overlapped Partitioning Method of Image 


4, EXPERIMENTAL RESULTS AND DISCUSSION 


Parallel algorithms for region labeling and median filtering have 
been implemented on the iPSC and simulated on the FCMs by using 
the iPSC. The simulation results are based on the following 
conservative assumptions. Firstly, for simplicity and sufficiency, t; and 
fa are assumed to be 1 msec. Secondly, the data acquisition time and 
the data transfer time are assumed to be 40 msec for all machines. 
Lastly, assume that the PEs of the FCMs have the same computing 
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capabilities as those of the iPSC. In practice, t; and ft, are usually 
much smaller than 1 msec (nano second order), and the data acquisition 
time and the data transfer time for a 256 x 256 binary image 
(64Kbytes), are less than 40msec [16]. 


4.1 Region Labeling 


User-generated binary images (256 x 256) were used to test all 
aspects of the algorithm. The images have a variety of regions. A data 
set which resembles a mesh containing a variety of region types is used 
for performance evaluation. 


To evaluate the performance of the FCMs, each time component, 
the total time, and the speed-up are listed in Table 4.1.1. The speed-up 
is plotted in Fig. 4.1.1. As shown in this figure, the speed-up of the 
FCMs is better than that of the hypercube multiprocessor. The reason 
is that communication overhead increases as N increases on the 
hypercube multiprocessor. On the FCMs, there is no communication 
overhead that increases with N, only marginal overhead in switching 
and control. Even with the conservative assumptions, the switching and 
control times on the FCMs are much smaller than computation time. 
Hence, bottlenecks due to the CU, the communication bus and the 
control bus are minimal. 


The overlapping method (FCM II) achieves better performance 
than the non-overlapping method (FCM I) as expected. The FCM II 
can reduce even the data acquisition time and the data transfer time. If 
these times are large, the difference in performance is large. Even 
though the difference is small, the FCM II has more applicability as 
described in subsection 2.2.3. 


Even with no communication overhead, the speed-up is not 
linearly proportional to N because there are some necessary tasks that 
are independent of N. For instance, the amount of time to label regions 
based on merged TPP’s is fixed because it must be done by a single 
PE. This behavior can be observed in Table 4.1.1. 


4.2 Median Filtering 


The measured performance figures are tabulated in Table 4.2.1. 
The speed-up is plotted in Fig. 4.2.1. As shown in this figure, 
significant speed-up is achieved. The speed-up is better than that of 
region labeling. The reason is that there is no mering procedure, and 
computation time is larger than that for region labeling. In other words, 
the ratio of computation time to parallel processing overhead in median 
filtering is larger than the ratio for region labeling. The computation 
time (¢,,) decreases linearly, as N increases. The speed-up of the FCMs 
is approximately linearly proportional to N because there is no 
communication overhead, no merging procedure, and no necessary task 
which must be done by a single PE. Accordingly, as N increases, more 
speed-up may be achieved. For many neighborhood operations, such 
significant speed-up may be expected. 


5. CONCLUSIONS 


In this paper, two Flexibly (Tightly/Loosely) Coupled 
Multiprocessors for image processing are proposed. These 
multiprocessors alleviate the disadvantages of existing multiprocessors, 
such as communication overhead due to message passing in loosely 
coupled multiprocessors, memory contention due to shared memory in 
tightly coupled multiprocessors, and overhead due to inefficient loading 
and unloading data. 


Some features of the FCMs have several superficial similarities to 
those of other architectures, such as the VS [7], the SM3 [8] and the 
MP/C [9]. However, there exist several significant differences. The 
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Fig. 4.1.1 Comparison of Speed-up for Region Labeling 


FCMs differ considerably from the VS in that the latter is a goal- 
oriented architecture, being functionally dedicated, with inhomogeneous 
processors and memory modules. Other significant differences between 
the FCMs and the VS architecture are in the bus structures, switches, 
memory addressing scheme, and I/O scheme. 


In the FCMs, a set of adjacent memory modules can be formed 
into one contiguously addressable module, while this feature is not 
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found in the SM3 and the MP/C. In the worst case, if the processor in 
the SM3 or the MP/C tries to repeatedly access data stored in two 
alternate memory modules, then module switching in the SM3 or 
computing an effective address by the switch controller in the MP/C 
must precede each memory access. In image processing, this situation 
occurs often. One example is the merging of subimages using 
boundary consistency. If two adjacent subimages in two different 


memory modules are to be combined into one subimage using boundary 
consistency, the alternate memory modules need to be repeatedly 
accessed for every boundary pixel processed by the MP/C and the SM3. 
In contrast, in the FCMs only one switching time is needed to connect 
the memory modules. 


The input/output processor is directly connected to the variable 
space memory in the FCMs. This scheme can minimize the overhead to 
load and unload data, which may be significant in the SM3 and the 
MP/C. In each partition of the MP/C, only one processor can be active, 
while in the FCMs, any PE in any partition can be active. Only one 
type of communication (processor-to-memory, no direct processor-to- 
processor communication) is supported in the MP/C, while three 
different types of communication are supported in the FCMs. In 
addition, the switch controller in the MP/C and the three different 
control buses with switches in the SM3 are complicated. The FCMs 
use simple switches in one control bus. The SM3 is a multicomputer, 
in which each node is an independent computer system with its own 
secondary storage device, but the FCMs are multiprocessors. 


Parallel algorithms for region labeling and median filtering have 
been simulated on the proposed architectures by using the iPSC. The 
performance of the FCMs shows remarkable improvement over the 
existing hypercube multiprocessor. The FCMs can also be used for 
more general tasks. Image processing, MSIMD processing, SIMD 
processing, pipelined pseudoparallel algorithms including pipelined 
algorithms and tree-structured algorithms are a few examples. The 
FCMs are highly suited when data locality is guaranteed. 


Since the CU, the communication bus, and the control bus are 
mainly used for control, but not for the exchange of data, bottlenecks 
due to these components are minimal as described in subsection 4.1. In 
addition, the architectures are simple and modularized, and the control 
Strategy is straightforward. Hence, the FCMs have good scalability. 
The granularities of the proposed architectures may be considered in the 
range from coarse to fine. The applications for more general tasks will 
be investigated in future research. 
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